Did you know ... Search Documentation:
rdf_litindex.pl -- Search literals
PublicShow source

This module finds literals of the RDF database based on words, stemming and sounds like (metaphone). The normal user-level predicate is

Source rdf_set_literal_index_option(+Options:list)
Set options for the literal package. Currently defined options
verbose(Bool)
If true, print progress messages while building the index tables.
index_threads(+Count)
Number of threads to use for initial indexing of literals
index(+How)
How to deal with indexing new literals. How is one of self (execute in the same thread), thread(N) (execute in N concurrent threads) or default (depends on number of cores).
stopgap_threshold(+Count)
Add a token to the dynamic stopgap set if it appears in more than Count literals. The default is 50,000.
Source rdf_find_literal(+Spec, -Literal) is nondet
Source rdf_find_literals(+Spec, -Literals) is det
Find literals in the RDF database matching Spec. Spec is defined as:
Spec ::= and(Spec,Spec)
Spec ::= or(Spec,Spec)
Spec ::= not(Spec)
Spec ::= sounds(Like)
Spec ::= stem(Like)             % same as stem(Like, en)
Spec ::= stem(Like, Lang)
Spec ::= prefix(Prefix)
Spec ::= between(Low, High)     % Numerical between
Spec ::= ge(High)               % Numerical greater-equal
Spec ::= le(Low)                % Numerical less-equal
Spec ::= Token

sounds(Like) and stem(Like) both map to a disjunction. First we compile the spec to normal form: a disjunction of conjunctions on elementary tokens. Then we execute all the conjunctions and generate the union using ordered-set algorithms.

Stopgaps are ignored. If the final result is only a stopgap, the predicate fails.

To be done
- Exploit ordering of numbers and allow for > N, < N, etc.
Source rdf_token_expansions(+Spec, -Extensions)
Determine which extensions of a token contribute to finding literals.
Source compile_spec(+Spec, -Compiled)[private]
Compile a specification as above into disjunctive normal form
Source nnf(+Formula, -NNF)[private]
Rewrite to Negative Normal Form, meaning negations only appear around literals.
Source dnf(+NNF, -DNF)[private]
Convert a formula in NNF to Disjunctive Normal Form (DNF)
Source token_index(-Map)[private]
Get the index of tokens. If not present, create one from the current database. Once created, the map is kept up-to-date using a monitor hook.
Source make_literal_index[private]
Create the initial literal index.
Source clean_token_index[private]
Clean after a reset.
Source rdf_delete_literal_index(+Type)
Fully delete a literal index
Source create_update_literal_thread(+Threads)[private]
Setup literal monitoring using threads. While loading databases through rdf_attach_db/2 from rdf_persistency.pl, most of the time is spent updating the literal token database. While loading the RDF triples, most of the time is spend in updating the AVL tree holding the literals. Updating the token index hangs on updating the AVL trees holding the tokens. Both tasks however can run concurrently.
Source check_index_workers(+Queue, +Keys)[private]
Increase the number of workers indexing literals sent to Queue if the queue gets overful.
Source register_literal(+Literal)[private]
Associate the tokens of a literal with the literal itself.
Source unregister_literal(+Literal)[private]
Literal is removed from the database. As we abstract from lang and type qualifiers we first have to check this is the last one that is destroyed.
Source rdf_tokenize_literal(+Literal, -Tokens) is semidet
Tokenize a literal. We make this hookable as tokenization is generally domain dependent.
Source rdf_stopgap_token(-Token) is nondet
True when Token is a stopgap token. Currently, this implies one of:
  • exclude_from_index(token, Token) is true
  • default_stopgap(Token) is true
  • Token is an atom of length 1
  • Token was added to the dynamic stopgap token set because it appeared in more than stopgap_threshold literals.
Source default_stopgap(?Token)[private]
Tokens we do not wish to index, as they creat huge amounts of data with little or no value. Is there a more general way to describe this? Experience shows that simply word count is not a good criterium as it often rules out popular domain terms.
Source text_of(+LiteralArg, -Lang, -Text) is semidet[private]
Get the textual or (integer) numerical information from a literal value. Lang is the language to use for stemming. Currently we use English for untyped plain literals or literals typed xsd:string. Formally, these should not be tokenized, but a lot of data out there does not tag strings with their language.
Source stem_index(-Map) is det[private]
Get the stemming literal index. This index is created on demand. If some thread is creating the index, other threads wait for its completion.
Source rdf_literal_index(+Type, -Index) is det
True when Index is a literal map containing the index of Type. Type is one of:
token
Tokens are basically words of literal values. See rdf_tokenize_literal/2. The token map maps tokens to full literal texts.
stem
Index of stemmed tokens. If the language is available, the tokens are stemmed using the matching snowball stemmer. The stem map maps stemmed to full tokens.
metaphone
Phonetic index of tokens. The metaphone map maps phonetic keys to tokens.