Pack tokenize -- design_notes.md

Design Notes

Initially extracted from conversation with [@Annieppo](https://github.com/Anniepoo) and [@nicoabie](https://github.com/nicoabie) in ##prolog on freenode.

The library started as a very simple and lightweight set of predicates for a common, but very limited, form of lexing. As we extend it, we aim to maintain a modest scope in order to achieve a sweet spot between ease of use and powerful flexibility.

Scope and Aims

tokenize does not aspire to become an industrial strength lexer generator. We aim to serve most users needs between raw input and a structured form ready for parsing by a DCG.

If a user is parsing a language with keywords such as class, module, etc., and wants to distinguish these from variable names, tokenize isn't going to give you this out of the box. But, it should provide an easy means of achieving this result through a subsequent lexing pass.

Some Model Users

  • somebody making a computer language
    • needs to be able to distinguish keywords, variables and literals
    • needs to be able to identify comments
  • somebody making a parser for an interactive fiction game
    • needs to handle stuff like "William O. N'mutu-O'Connell went to the market"
  • somebody wanting to analyze human texts
    • wanting to do some analysis on New York Times articles, they want to first process the articles into meaningful tokens

Design Rules

  • We don't parse.
  • Every token generated is callable (i.e., an atom or compound).
    • Example of an possible compound token: space(' ').
    • Example of a possible atom token: escape. tokenization need to return tokens represented with the same arity)
  • Users should be able to determine the kind of token by unification.
  • Users should be able to clearly see and specify the precedence for tokenizaton
    • E.g., given "-12.3", numbers, punctuation should yield [pnct('-'), number(12), pnct('.'), number(3)] while punctuation, numbers should yield [number(-12.3)].