4 library(isub): isub: a string similarity measure

author: Giorgos Stoilos
See also: A string metric for ontology alignment by Giorgos Stoilos, 2005 - http://www.image.ece.ntua.gr/papers/378.pdf .

The library(isub) implements a similarity measure between strings, i.e., something similar to the Levenshtein distance. This method is based on the length of common substrings.

[det]isub(+Text1:text, +Text2:text, -Similarity:float, +Options:list)

Similarity is a measure of the similarity/dissimilarity between Text1 and Text2. E.g.

?- isub('E56.Language', 'languange', D, [normalize(true)]).
D = 0.4226950354609929.                       % [-1,1] range

?- isub('E56.Language', 'languange', D, [normalize(true),zero_to_one(true)]).
D = 0.7113475177304964.                       % [0,1] range

?- isub('E56.Language', 'languange', D, []).  % without normalization
D = 0.19047619047619047.                      % [-1,1] range

?- isub(aa, aa, D, []).  % does not work for short substrings
D = -0.8.

?- isub(aa, aa, D, [substring_threshold(0)]). % works with short substrings
D = 1.0.                                      % but may give unwanted values
                                              % between e.g. 'store' and 'spore'.

?- isub(joe, hoe, D, [substring_threshold(0)]).
D = 0.5315315315315314.

?- isub(joe, hoe, D, []).
D = -1.0.

This is a new version of isub/4 which replaces the old version while providing backwards compatibility. This new version allows several options to tweak the algorithm.

Text1 and Text2 are either an atom, string or a list of characters or character codes.

Similarity is a float in the range [-1,1.0], where 1.0 means most similar. The range can be set to [0,1] with the zero_to_one option described below.

Options

is a list with elements described below. Please note that the options are processed at compile time using goal_expansion to provide much better speed. Supported options are:

normalize(+Boolean): Applies string normalization as implemented by the original authors: Text1 and Text2 are mapped to lowercase and the characters "._ " are removed. Lowercase mapping is done with the C-library function towlower(). In general, the required normalization is domain dependent and is better left to the caller. See e.g., unaccent_atom/2. The default is to skip normalization (false).
zero_to_one(+Boolean): The old isub implementation deviated from the original algorithm by returning a value in the [0,1] range. This new isub/4 implementation defaults to the original range of [-1,1], but this option can be set to true to set the output range to [0,1].
substring_threshold(+Nonneg): The original algorithm was meant to compare terms in semantic web ontologies, and it had a hard coded parameter that only considered substring similarities greater than 2 characters. This caused the similarity between, for example’aa' and’aa' to return -0.8 which is not expected. This option allows the user to set any threshold, such as 0, so that the similatiry between short substrings can be properly recognized. The default value is 2 which is what the original algorithm used.

Index

?
atom_to_stem_list/2
double_metaphone/2
double_metaphone/3: 1
isub/4
porter_stem/2: 2
read/1: 2
snowball/3
snowball_current_algorithm/1
tokenize_atom/2
tokenize_atom/3: 2
unaccent_atom/2: 2