Pack pinyin -- prolog/pinyin.pl

This module implements a grammar that parses and generates words written in Hanyu Pinyin, the standard romanization system for Mandarin Chinese. It also provides a utility to convert whole texts between diacritics and numbers for writing tones.

num_dia(?Num, ?Dia)

Converts between Pinyin with diacritics and numbers as tone marks. Num and Dia are code lists.

Maximal substrings of Pinyin letters, the numbers 1-4 as well as the characters ' and - are converted if they can be parsed as lower-case, capitalized or all-caps Pinyin words in the input format, everything else is left alone. Case is preserved.

The assumed input format is diacritics if Num is variable, numbers otherwise.

Example usage:

?- set_prolog_flag(double_quotes, codes).
true.

?- num_dia("Wo3 xian4zai4 dui4 jing1ju4 hen3 gan3 xing4qu4.",
|    Codes), atom_codes(Atom, Codes).
Codes = [87, 466, 32, 120, 105, 224, 110, 122, 224|...],
Atom = 'Wǒ xiànzài duì jīngjù hěn gǎn xìngqù.'.

?- num_dia(Codes, "Nǐ ne?"), atom_codes(Atom, Codes).
Codes = [78, 105, 51, 32, 110, 101, 63],
Atom = 'Ni3 ne?'.

word(?Morphs, ?ND)//

A DCG that parses or generates a single word written in Hanyu Pinyin.

Morphs is a list of "morphs" (not in the strictest linguistic sense) that make up the word. They take one of three forms:

Initial-Final-Tone where Initial and Final are atoms. The final takes the "underlying" form, which may be different from the written form. Tone is one of the integers from 0 to 4.
r (for the erhuayin suffix)
- (for word-internal hyphens)

ND is either num or dia, depending on how tones are represented (numbers or diacritics).

The grammar implements the following tricky aspects of Hanyu Pinyin:

placement of w/y before i/u/ü as required
abbreviation of triphthongs into two vowel letters as required
change of ü to u after j/q/x/y
placement of tone marks
placement of apostrophes between syllables as required
the possibility of word-internal hyphens
the possibility of erhuayin, placing r at the end of a word

The following aspects are currently not supported:

Case. This DCG only handles lower-case letters.
The spelling variant where ü is written as v.
The spelling variant where the neutral tone is written as 0 or 5.
Phonological restrictions. Not every initial, final and tone combine into a syllable in Mandarin. This grammar, however, will happily parse and generate any combination. It only enforces that the erhuayin suffix not appear after the e final, and that j/x/q not be followed by an u sound, because either would make the grammar ambiguous.
Dropping apostrophes when using numbers instead of tone marks.

Example usage:

?- set_prolog_flag(double_quotes, codes).
true.

?- phrase(word([n-ü-3, ''-er-2], ND), Codes), atom_codes(Atom, Codes).
ND = dia,
Codes = [110, 474, 39, 233, 114],
Atom = 'nǚ\'ér' ;
ND = num,
Codes = [110, 252, 51, 39, 101, 114, 50],
Atom = 'nü3\'er2' ;
false.

?- phrase(word(Morphs, ND), "yìjué").
Morphs = [''-i-4, j-üe-2],
ND = dia ;
false.