This module finds literals of the RDF database based on words, stemming
and sounds like (metaphone). The normal user-level predicate is
- rdf_set_literal_index_option(+Options:list)
- Set options for the literal package. Currently defined options
- verbose(Bool)
- If
true
, print progress messages while building the
index tables.
- index_threads(+Count)
- Number of threads to use for initial indexing of
literals
- index(+How)
- How to deal with indexing new literals. How is one of
self
(execute in the same thread), thread(N)
(execute
in N concurrent threads) or default
(depends on number
of cores).
- stopgap_threshold(+Count)
- Add a token to the dynamic stopgap set if it appears in
more than Count literals. The default is 50,000.
- rdf_find_literal(+Spec, -Literal) is nondet
- rdf_find_literals(+Spec, -Literals) is det
- Find literals in the RDF database matching Spec. Spec is defined
as:
Spec ::= and(Spec,Spec)
Spec ::= or(Spec,Spec)
Spec ::= not(Spec)
Spec ::= sounds(Like)
Spec ::= stem(Like) % same as stem(Like, en)
Spec ::= stem(Like, Lang)
Spec ::= prefix(Prefix)
Spec ::= between(Low, High) % Numerical between
Spec ::= ge(High) % Numerical greater-equal
Spec ::= le(Low) % Numerical less-equal
Spec ::= Token
sounds(Like)
and stem(Like)
both map to a disjunction. First we
compile the spec to normal form: a disjunction of conjunctions
on elementary tokens. Then we execute all the conjunctions and
generate the union using ordered-set algorithms.
Stopgaps are ignored. If the final result is only a stopgap, the
predicate fails.
- To be done
- - Exploit ordering of numbers and allow for > N, < N, etc.
- rdf_token_expansions(+Spec, -Extensions)
- Determine which extensions of a token contribute to finding
literals.
- compile_spec(+Spec, -Compiled)[private]
- Compile a specification as above into disjunctive normal form
- nnf(+Formula, -NNF)[private]
- Rewrite to Negative Normal Form, meaning negations only appear
around literals.
- dnf(+NNF, -DNF)[private]
- Convert a formula in NNF to Disjunctive Normal Form (DNF)
- token_index(-Map)[private]
- Get the index of tokens. If not present, create one from the
current database. Once created, the map is kept up-to-date using
a monitor hook.
- make_literal_index[private]
- Create the initial literal index.
- clean_token_index[private]
- Clean after a reset.
- rdf_delete_literal_index(+Type)
- Fully delete a literal index
- create_update_literal_thread(+Threads)[private]
- Setup literal monitoring using threads. While loading databases
through rdf_attach_db/2 from rdf_persistency.pl, most of the
time is spent updating the literal token database. While loading
the RDF triples, most of the time is spend in updating the AVL
tree holding the literals. Updating the token index hangs on
updating the AVL trees holding the tokens. Both tasks however
can run concurrently.
- check_index_workers(+Queue, +Keys)[private]
- Increase the number of workers indexing literals sent to Queue
if the queue gets overful.
- register_literal(+Literal)[private]
- Associate the tokens of a literal with the literal itself.
- unregister_literal(+Literal)[private]
- Literal is removed from the database. As we abstract from lang
and type qualifiers we first have to check this is the last one
that is destroyed.
- rdf_tokenize_literal(+Literal, -Tokens) is semidet
- Tokenize a literal. We make this hookable as tokenization is
generally domain dependent.
- rdf_stopgap_token(-Token) is nondet
- True when Token is a stopgap token. Currently, this implies one
of:
exclude_from_index(token, Token)
is true
default_stopgap(Token)
is true
- Token is an atom of length 1
- Token was added to the dynamic stopgap token set because
it appeared in more than stopgap_threshold literals.
- default_stopgap(?Token)[private]
- Tokens we do not wish to index, as they creat huge amounts of
data with little or no value. Is there a more general way to
describe this? Experience shows that simply word count is not a
good criterium as it often rules out popular domain terms.
- text_of(+LiteralArg, -Lang, -Text) is semidet[private]
- Get the textual or (integer) numerical information from a
literal value. Lang is the language to use for stemming.
Currently we use English for untyped plain literals or literals
typed xsd:string. Formally, these should not be tokenized, but a
lot of data out there does not tag strings with their language.
- stem_index(-Map) is det[private]
- Get the stemming literal index. This index is created on demand.
If some thread is creating the index, other threads wait for its
completion.
- rdf_literal_index(+Type, -Index) is det
- True when Index is a literal map containing the index of Type.
Type is one of:
- token
- Tokens are basically words of literal values. See
rdf_tokenize_literal/2. The
token
map maps tokens to full
literal texts.
- stem
- Index of stemmed tokens. If the language is available, the
tokens are stemmed using the matching snowball stemmer.
The
stem
map maps stemmed to full tokens.
- metaphone
- Phonetic index of tokens. The
metaphone
map maps phonetic
keys to tokens.