| Did you know ... | Search Documentation: |
| library(unicode_security): Unicode security helpers (UTS #39, UAX #24) |
This library implements helpers from UTS #39 (Unicode Security Mechanisms) and the script properties of UAX #24. It is intended for linters, identifier validators and any code that needs to reason about confusable look-alike text or mixed-script identifiers. It does not alter the Prolog reader; UTS #39 is deliberately a library-level facility.
The library ships its own UCD-derived tables and is independent of
library(unicode) (which wraps libutf8proc for normalisation
and per-code-point properties). See etc/gen_uts39.pl in the
package directory to regenerate the tables on a Unicode-version bump.
Predicates fall into three groups:
latin, cyrillic,
han, common, inherited, ...).
Fails for code points outside the Unicode range or with no entry in Scripts.txt.[Script]. Fails
for code points outside the Unicode range and for code points with no
entry in either ScriptExtensions.txt or Scripts.txt.allowed, when Code
is listed in UTS
#39 IdentifierStatus.txt with status Allowed.
Fails otherwise — per UTS #39 every code point not listed there is
Restricted by default; rather than return restricted for
everything else, this predicate simply fails.recommended, inclusion, technical, obsolete, limited_use,
exclusion, not_nfkc, not_xid, default_ignorable,
deprecated, uncommon_use). Fails for code
points outside the Unicode range or with no entry in IdentifierType.txt.confusables.txt
prototype string, then apply NFD again. Two strings are confusable iff
their skeletons compare equal.true, skip the per-character substitution when the
source and target form a pair listed in UTS #39 intentional.txt
(e.g. Latin A versus Greek capital Alpha). Default false.augscx(c)
over all non-Common/non-Inherited characters, with the augmentation
rules for Han, Hiragana, Katakana, Hangul and Bopomofo applied. The
empty list signals a mixed-script string.
ascii_only — every code point in U+0020..U+007E and
Allowed.single_script — augmented resolved-script-set
non-empty and every code point Allowed.highly_restrictive — covered by Latin plus one of Hanb,
Jpan or Kore (UTS #39 §5.1 augmented
profiles).moderately_restrictive — covered by Latin plus a
single non-Latin Recommended script (Cyrl or Grek).minimally_restrictive — every code point has
Identifier_Type in {recommended, inclusion}.unrestricted — otherwise. A linter that walks source
clauses and reports atoms with the confusability issues above is
registered in library(check) itself (predicate list_confusable_identifiers/0);
see the
library(check) documentation for details.