SWI-Prolog -- library(unicode_security): Unicode security helpers (UTS #39, UAX #24)

Documentation
- Reference manual
- Packages
  - SWI-Prolog Unicode library

2 library(unicode_security): Unicode security helpers (UTS #39, UAX #24)

This library implements helpers from UTS #39 (Unicode Security Mechanisms) and the script properties of UAX #24. It is intended for linters, identifier validators and any code that needs to reason about confusable look-alike text or mixed-script identifiers. It does not alter the Prolog reader; UTS #39 is deliberately a library-level facility.

The library ships its own UCD-derived tables and is independent of library(unicode) (which wraps libutf8proc for normalisation and per-code-point properties). See etc/gen_uts39.pl in the package directory to regenerate the tables on a Unicode-version bump.

Predicates fall into three groups:

Per-code-point lookups: unicode_script/2, unicode_script_extensions/2, unicode_identifier_status/2, unicode_identifier_type/2.
Skeleton and confusable test (UTS #39 §4): unicode_skeleton/2, unicode_confusable/2, unicode_confusable/3.
String-level identifier checks (UTS #39 §5): unicode_resolved_scripts/2, unicode_restriction_level/2.

[semidet]unicode_script(+Code:integer, -Script:atom)

True when Script is the UAX #24 Script_Property of Code. Script is a lower-case atom of the long property value (latin, cyrillic, han, common, inherited, ...). Fails for code points outside the Unicode range or with no entry in Scripts.txt.

[semidet]unicode_script_extensions(+Code:integer, -Scripts:list(atom))

Scripts is the sorted list of UAX #24 Script_Extensions of Code. For most code points this is a singleton [Script]. Fails for code points outside the Unicode range and for code points with no entry in either ScriptExtensions.txt or Scripts.txt.

[semidet]unicode_identifier_status(+Code:integer, -Status:atom)

Succeeds, unifying Status with allowed, when Code is listed in UTS #39 IdentifierStatus.txt with status Allowed. Fails otherwise — per UTS #39 every code point not listed there is Restricted by default; rather than return restricted for everything else, this predicate simply fails.

[semidet]unicode_identifier_type(+Code:integer, -Types:list(atom))

Types is the sorted list of UTS #39 Identifier_Type atoms for Code (recommended, inclusion, technical, obsolete, limited_use, exclusion, not_nfkc, not_xid, default_ignorable, deprecated, uncommon_use). Fails for code points outside the Unicode range or with no entry in IdentifierType.txt.

[det]unicode_skeleton(+Text, -Skeleton:atom)

Compute the UTS #39 §4 skeleton of Text: apply NFD, substitute each code point with its confusables.txt prototype string, then apply NFD again. Two strings are confusable iff their skeletons compare equal.

[semidet]unicode_confusable(+T1, +T2)

True when unicode_skeleton/2 of T1 and T2 are equal.

[semidet]unicode_confusable(+T1, +T2, +Options)

As unicode_confusable/2. Options:

ignore_intentional(+Bool): If true, skip the per-character substitution when the source and target form a pair listed in UTS #39 intentional.txt (e.g. Latin A versus Greek capital Alpha). Default false.

[det]unicode_resolved_scripts(+Text, -Scripts:list(atom))

Scripts is the UTS #39 §5.1 resolved augmented Script_Extensions set of Text: the intersection of augscx(c) over all non-Common/non-Inherited characters, with the augmentation rules for Han, Hiragana, Katakana, Hangul and Bopomofo applied. The empty list signals a mixed-script string.

[det]unicode_restriction_level(+Text, -Level:atom)

Classify Text under UTS #39 §5.2 at the most restrictive level for which it qualifies. Level is one of:

ascii_only — every code point in U+0020..U+007E and Allowed.
single_script — augmented resolved-script-set non-empty and every code point Allowed.
highly_restrictive — covered by Latin plus one of Hanb, Jpan or Kore (UTS #39 §5.1 augmented profiles).
moderately_restrictive — covered by Latin plus a single non-Latin Recommended script (Cyrl or Grek).
minimally_restrictive — every code point has Identifier_Type in {recommended, inclusion}.
unrestricted — otherwise. A linter that walks source clauses and reports atoms with the confusability issues above is registered in library(check) itself (predicate list_confusable_identifiers/0); see the library(check) documentation for details.