Did you know ... Search Documentation:
Pack tokenize -- prolog/tokenize.pl
PublicShow source

This module offers a simple tokenizer with some basic options.

It may be improved towards a cleaner and more extensible tool if there is enough interest (from myself or others).

author
- Shon Feder
license
- http://unlicense.org/

Rational:

tokenize_atom/2, in library(porter_stem), is inflexible, in that it doesn't allow for the preservation of white space or control characters, and it only tokenizes into a list of atoms. This module allows for options to include or exclude things like spaces and punctuation, and for packing tokens.

It also provides a simple predicate for reading lists of tokens back into text.

Ideally, provided I have the drive and/or there is any interest in this package, this would become an extensible, easily configurable tokenizing utility.

 untokenize(+Tokens:list(term), -Untokens:list(codes)) is semidet
True when Untokens is unified with a code list representation of each token in Tokens.
 tokenize_file(+File:atom, -Tokens:list(term)) is semidet
See also
- tokenize_file/3 is called with an empty list of options: thus, with defaults.
 tokenize_file(+File:atom, -Tokens:list(term), +Options:list(term)) is semidet
True when Tokens is unified with a list of tokens represening the text of File.
See also
- tokenize/3 which has the same available options and behavior.
 tokenize(+Text:list(code), -Tokens:list(term)) is semidet
See also
- tokenize/3 is called with an empty list of options: thus, with defaults.
 tokenize(+Text:list(code), -Tokens:list(term), +Options:list(term)) is semidet
True when Tokens is unified with a list of tokens representing the text from Text, according to the options specified in Options.

NOTE: this predicate currently fails if invalid option arguments are given and, worse, it succeeds silently if there are invalid option parameters.

A token is one of:

  • a word (contiguous alpha-numeric chars): word(W)
  • a punctuation mark (determined by char_type(C, punct)): punct(P)
  • a control character (determined by char_typ(C, cntrl)): cntrl(C)
  • a space ( == ): spc(S).

Valid options are:

  • cased(+bool) : Determines whether tokens perserve cases of the source text.
  • spaces(+bool) : Determines whether spaces are represted as tokens or discarded.
  • cntrl(+bool) : Determines whether control characters are represented as tokens or discarded.
  • punct(+bool) : Determines whether punctuation characters are represented as tokens or discarded.
  • to(+on_of([strings,atoms,chars,codes])) : Determines the representation format used for the tokens.
  • pack(+bool) : Determines whether tokens are packed or repeated.