|Did you know ...||Search Documentation:|
|library(semweb/rdf_db): The RDF database|
The central module of the RDF infrastructure is
It provides storage and indexed querying of RDF triples. RDF data is
stored as quintuples. The first three elements denote the RDF triple.
The extra Graph and Line elements provide information
about the origin of the triple.
The actual storage is provided by the foreign language (C)
module. Using a dedicated C-based implementation we can reduce memory
usage and improve indexing capabilities, for example by providing a
dedicated index to support entailment over
Currently the following indexes are provided (S=subject, P=predicate,
rdfs:subPropertyOfrelations. This index supports rdf_has/3 to query a property and all its children efficiently.
(rdf(R,_,_);rdf(_,_,R))normally produces many duplicate answers.
library(semweb/litindex)provides indexed search on tokens inside literals.
literal(Value)if the object is a literal value. If a value of the form NameSpaceID:LocalName is provided it is expanded to a ground atom using expand_goal/2. This implies you can use this construct in compiled code without paying a performance penalty. Literal values take one of the following forms:
rdf:datatypeTypeID. The Value is either the textual representation or a natural Prolog representation. See the option convert_typed_literal(:Convertor) of the parser. The storage layer provides efficient handling of atoms, integers (64-bit) and floats (native C-doubles). All other data is represented as a Prolog record.
For literal querying purposes, Object can be of the form
literal(+Query, -Value), where Query is one of the terms
below. If the Query takes a literal argument and the value has a numeric
type numerical comparison is performed.
icase(Text). Backward compatibility.
Backtracking never returns duplicate triples. Duplicates can be
retrieved using rdf/4. The predicate rdf/3
raises a type-error if called with improper arguments. If rdf/3
is called with a term
literal(_) as Subject or Predicate
object it fails silently. This allows for graph matching goals like
rdf(O,P2,O2) to proceed without
|Source||is a term Graph:Line. If Source is instatiated, passing an atom is the same as passing Atom:_.|
rdf(Subject, Predicate, Object)is true exploiting the rdfs:subPropertyOf predicate as well as inverse predicates declared using rdf_set_predicate/2 with the
If used with either Subject or Object unbound, it first returns the origin, followed by the reachable nodes in breadth-first search-order. The implementation internally looks one solution ahead and succeeds deterministically on the last solution. This predicate never generates the same node twice and is robust against cycles in the transitive relation.
With all arguments instantiated, it succeeds deterministically if a path can be found from Subject to Object. Searching starts at Subject, assuming the branching factor is normally lower. A call with both Subject and Object unbound raises an instantiation error. The following example generates all subclasses of rdfs:Resource:
?- rdf_reachable(X, rdfs:subClassOf, rdfs:'Resource'). X = 'http://www.w3.org/2000/01/rdf-schema#Resource' ; X = 'http://www.w3.org/2000/01/rdf-schema#Class' ; X = 'http://www.w3.org/1999/02/22-rdf-syntax-ns#Property' ; ...
infiniteto impose no distance-limit.
The predicates below enumerate the basic objects of the RDF store. Most of these predicates also enumerate objects that are not associated to any currently visible triple. Objects are retained as long as they are visible in active queries or snapshots. After that, some are reclaimed by the RDF garbage collector, while others are never reclaimed.
This predicate is primarily intended as a way to process all resources without processing resources twice. The user must be aware that some of the returned resources may not appear in any visible triple.
Note that resources that have
are not automatically included in the result-set of this predicate,
while all resources that appear as the second argument of a
triple are included.
The predicates below modify the RDF store directly. In addition, data may be loaded using rdf_load/2 or by restoring a persistent database using rdf_attach_db/2. Modifications follow the Prolog logical update view semantics, which implies that modifications remain invisible to already running queries. Further isolation can be achieved using rdf_transaction/3.
user. Subject and Predicate are resources. Object is either a resource or a term
literal(Value). See rdf/3 for an explanation of Value for typed and language qualified literals. All arguments are subject to name-space expansion. Complete duplicates (including the same graph and `line' and with a compatible `lifespan') are not added to the database.
|Graph||is either the name of a graph (an atom) or a term Graph:Line, where Line is an integer that denotes a line number.|
The update semantics of the RDF database follows the conventional Prolog logical update view. In addition, the RDF database supports transactions and snapshots.
rdf_transaction(Goal, user, ). See rdf_transaction/3.
rdf_transaction(Goal, Id, ). See rdf_transaction/3.
Processed options are:
true, which implies that an anonymous snapshot is created at the current state of the store. Modifications due to executing Goal are only visible to Goal.
snapshotoption. A snapshot created outside a transaction exists until it is deleted. Snapshots taken inside a transaction can only be used inside this transaction.
_:. For backward compatibility reason,
__is also considered to be a blank node.
The RDF library can read and write triples in RDF/XML and a
proprietary binary format. There is a plugin interface defined to
support additional formats. The
this plugin API to support loading Turtle files using rdf_load/2.
rdf_load(FileOrList, ). See rdf_load/2.
share(default), equivalent blank nodes are shared in the same resource.
library(semweb/turtle)extend the set of recognised extensions.
file://URL when loading a file or, if the specification is a URL, its normalized version without the optional #fragment.
false, do not use or create a cache file.
xmlnsnamespace declarations or Turtle
@prefixprefixes using rdf_register_prefix/3 if there is no conflict.
true, the message reporting completion is printed using level
silent. Otherwise the level is
informational. See also print_message/2.
Other options are forwarded to process_rdf/3. By default, rdf_load/2 only loads RDF/XML from files. It can be extended to load data from other formats and locations using plugins. The full set of plugins relevant to support different formats and locations is below:
:- use_module(library(semweb/turtle)). % Turtle and TriG :- use_module(library(semweb/rdf_ntriples)). :- use_module(library(semweb/rdf_zlib_plugin)). :- use_module(library(semweb/rdf_http_plugin)). :- use_module(library(http/http_ssl_plugin)).
rdf_save(Out, ). See rdf_save/2 for details.
true) do not save blank nodes that do not appear (indirectly) as object of a named resource.
xml:langsaved with rdf:RDF element.
false), inline resources when encountered for the first time. Normally, only bnodes are handled this way.
false), emit subjects sorted on the full URI. Useful to make file comparison easier.
false, do not include the
xml:basedeclaration that is written normally when using the
true), never use xml attributes to save plain literal attributes, i.e., always used an XML element as in
|Out||Location to save the data. This can also
be a file-url (|
Sometimes it is necessary to make more arbitrary selections of
material to be saved or exchange RDF descriptions over an open network
link. The predicates in this section provide for this. Character
encoding issues are derived from the encoding of the Stream,
providing support for
Save an RDF header, with the XML header, DOCTYPE, ENTITY and opening the rdf:RDF element with appropriate namespace declarations. It uses the primitives from section 3.5 to generate the required namespaces and desired short-name. Options is one of:
rdfsare added to the provided List. If a namespace is not declared, the resource is emitted in non-abreviated form.
Fast loading and saving
Loading and saving RDF format is relatively slow. For this reason we
designed a binary format that is more compact, avoids the complications
of the RDF parser and avoids repetitive lookup of (URL) identifiers.
Especially the speed improvement of about 25 times is worth-while when
loading large databases. These predicates are used for caching by
rdf_load/2 under certain
conditions as well as for maintaining persistent snapshots of the
Many RDF stores turned triples into quadruples. This store is no exception, initially using the 4th argument to store the filename from which the triple was loaded. Currently, the 4th argument is the RDF named graph. A named graph maintains some properties, notably to track origin, changes and modified state.
Additional graph properties can be added by defining rules for the multifile predicate property_of_graph/2. Currently, the following extensions are defined:
trueif the graph is persistent.
Literal values are ordered and indexed using a skip list. The aim of this index is threefold.
As string literal matching is most frequently used for searching
purposes, the match is executed case-insensitive and after removal of
diacritics. Case matching and diacritics removal is based on Unicode
character properties and independent from the current locale. Case
conversion is based on the `simple uppercase mapping' defined by Unicode
and diacritic removal on the `decomposition type'. The approach is
lightweight, but somewhat simpleminded for some languages. The tables
are generated for Unicode characters upto 0x7fff. For more information,
please check the source-code of the mapping-table generator
unicode_map.pl available in the sources of this package.
Currently the total order of literals is first based on the type of literal using the ordering numeric < string < term Numeric values (integer and float) are ordered by value, integers preceed floats if they represent the same value. Strings are sorted alphabetically after case-mapping and diacritic removal as described above. If they match equal, uppercase preceeds lowercase and diacritics are ordered on their unicode value. If they still compare equal literals without any qualifier preceeds literals with a type qualifier which preceeds literals with a language qualifier. Same qualifiers (both type or both language) are sorted alphabetically.
The ordered tree is used for indexed execution of
literal(prefix(Prefix), Literal) as well as
if Like does not start with a `*'. Note that results of queries
that use the tree index are returned in alphabetical order.
The predicates below form an experimental interface to provide more
reasoning inside the kernel of the rdb_db engine. Note that
transitive are not yet
supported by the rest of the engine. Also note that there is no relation
to defined RDF properties. Properties that have no triples are not
reported by this predicate, while predicates that are involved in
triples do not need to be defined as an instance of rdf:Property.
symmetric(true)is the same as
inverse_of(Predicate), i.e., creating a predicate that is the inverse of itself.
rdf_subject_branch_factorproperty, uniqueness of the object value is computed from the hash key rather than the actual values.
rdf_subject_branch_factor, but also considering triples of `subPropertyOf' this relation. See also rdf_has/3.
rdf_object_branch_factor, but also considering triples of `subPropertyOf' this relation. See also rdf_has/3.
Prolog code often contains references to constant resources with a
prefix (also known as XML namespaces). For example,
http://www.w3.org/2000/01/rdf-schema#Class refers to the
most general notion of an RDFS class. Readability and maintability
concerns require for abstraction here. The RDF database maintains a
table of known prefixes. This table can be queried using rdf_current_ns/2
and can be extended using rdf_register_ns/3.
The prefix database is used to expand
that appear as arguments to calls which are known to accept a resource.
This expansion is achieved by Prolog preprocessor using expand_goal/2.
rdf_current_prefix(Prefix, Expansion), atom_concat(Expansion, Local, URI),
true, replace existing namespace alias. Please note that replacing a namespace is dangerous as namespaces affect preprocessing. Make sure all code that depends on a namespace is compiled after changing the registration.
trueand Alias is already defined, keep the original binding for Prefix and succeed silently.
Without options, an attempt to redefine an alias raises a permission error.
Predefined prefixes are:
Alias IRI prefix dc http://purl.org/dc/elements/1.1/ dcterms http://purl.org/dc/terms/ eor http://dublincore.org/2000/03/13/eor# foaf http://xmlns.com/foaf/0.1/ owl http://www.w3.org/2002/07/owl# rdf http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs http://www.w3.org/2000/01/rdf-schema# serql http://www.openrdf.org/schema/serql# skos http://www.w3.org/2004/02/skos/core# void http://rdfs.org/ns/void# xsd http://www.w3.org/2001/XMLSchema#
Explicit expansion is achieved using the predicates below. The predicate rdf_equal/2 performs this expansion at compile time, while the other predicates do it at runtime.
Note that this predicate is a meta-predicate on its output argument. This is necessary to get the module context while the first argument may be of the form (:)/2. The above mode description is correct, but should be interpreted as (?,?).
Namespace handling for custom predicates
If we implement a new predicate based on one of the predicates of the semweb libraries that expands namespaces, namespace expansion is not automatically available to it. Consider the following code computing the number of distinct objects for a certain property on a certain object.
cardinality(S, P, C) :- ( setof(O, rdf_has(S, P, O), Os) -> length(Os, C) ; C = 0 ).
Now assume we want to write labels/2 that returns the number of distict labels of a resource:
labels(S, C) :- cardinality(S, rdfs:label, C).
This code will not work because
rdfs:label is not
expanded at compile time. To make this work, we need to add an rdf_meta/1
:- rdf_meta cardinality(r,r,-).
The example below defines the rule concept/1.
:- use_module(library(semweb/rdf_db)). % for rdf_meta :- use_module(library(semweb/rdfs)). % for rdfs_individual_of :- rdf_meta concept(r). %% concept(?C) is nondet. % % True if C is a concept. concept(C) :- rdfs_individual_of(C, skos:'Concept').
In addition to expanding calls, rdf_meta/1 also causes expansion of clause heads for predicates that match a declaration. This is typically used write Prolog statements about resources. The following example produces three clauses with expanded (single-atom) arguments:
:- use_module(library(semweb/rdf_db)). :- rdf_meta label_predicate(r). label_predicate(rdfs:label). label_predicate(skos:prefLabel). label_predicate(skos:altLabel).
This section describes the remaining predicates of the
|Location||is a term File:Line.|
When inside a transaction, Generation is unified to a term TransactionStartGen + InsideTransactionGen. E.g., 4+3 means that the transaction was started at generation 4 of the global database and we have created 3 new generations inside the transaction. Note that this choice of representation allows for comparing generations using Prolog arithmetic. Comparing a generation in one transaction with a generation in another transaction is meaningless.
triplesfor the interpretation of this value.
like. For backward compatibility,
exactis a synonym for
Major*10000 + Minor*100 + Patch.
Storing RDF triples in main memory provides much better performance than using external databases. Unfortunately, although memory is fairly cheap these days, main memory is severely limited when compared to disks. Memory usage breaks down to the following categories. Rough estimates of the memory usage is given for 64-bit systems. 32-bit system use slightly more than half these amounts.
Bucket arrays are resized if necessary. Old triples remain at their original location. This implies that a query may need to scan multiple buckets. The garbage collector may relocate old indexed triples. It does so by copying the old triple. The old triple is later reclaimed by GC. Reindexed triples will be reused, but many reindexed triples may result in a significant memory fragmentation.
The hash parameters can be controlled with rdf_set/1. Applications that are tight on memory and for which the query characteristics are more or less known can optimize performance and memory by fixing the hash-tables. By fixing the hash-tables we can tailor them to the frequent query patterns, we avoid the need for to check multiple hash buckets (see above) and we avoid memory fragmentation due to optimizing triples for resized hashes.
set_hash_parameters :- rdf_set(hash(s, size, 1048576)), rdf_set(hash(p, size, 1024)), rdf_set(hash(sp, size, 2097152)), rdf_set(hash(o, size, 1048576)), rdf_set(hash(po, size, 2097152)), rdf_set(hash(spo, size, 2097152)), rdf_set(hash(g, size, 1024)), rdf_set(hash(sg, size, 1048576)), rdf_set(hash(pg, size, 2048)).
pg. Parameter is one of:
The garbage collector
The RDF store has a garbage collector that runs in a separate thread named =__rdf_GC=. The garbage collector removes the following objects:
rdfs:subPropertyOfrelations that are related to old queries.
In addition, the garbage collector reindexes triples associated to
the hash-tables before the table was resized. The most recent resize
operation leads to the largest number of triples that require
reindexing, while the oldest resize operation causes the largest
slowdown. The parameter
optimize_threshold controlled by rdf_set/1
can be used to determine the number of most recent resize operations for
which triples will not be reindexed. The default is 2.
Normally, the garbage collector does it job in the background at a low priority. The predicate rdf_gc/0 can be used to reclaim all garbage and optimize all indexes.Warming up the database
The RDF store performs many operations lazily or in background threads. For maximum performance, perform the following steps:
warm_indexes :- ignore(rdf(s, _, _)), ignore(rdf(_, p, _)), ignore(rdf(_, _, o)), ignore(rdf(s, p, _)), ignore(rdf(_, p, o)), ignore(rdf(s, p, o)), ignore(rdf(_, _, _, g)), ignore(rdf(s, _, _, g)), ignore(rdf(_, p, _, g)).
__rdf_GCperforms garbage collection as long as it is considered `useful'.
Using rdf_gc/0 should only be needed to ensure a fully clean database for analysis purposes such as leak detection.
The duplicates marks are used to reduce the administrative load of avoiding duplicate answers. Normally, the duplicates are marked using a background thread that is started on the first query that produces a substantial amount of duplicates.