|Did you know ...||Search Documentation:|
|Wide character encodings on streams|
Although characters are uniquely coded using the Unicode standard internally, streams and files are byte (8-bit) oriented and there are a variety of ways to represent the larger Unicode codes in an 8-bit octet stream. The most popular one, especially in the context of the web, is UTF-8. Bytes 0 ... 127 represent simply the corresponding US-ASCII character, while bytes 128 ... 255 are used for multi-byte encoding of characters placed higher in the Unicode space. Especially on MS-Windows the 16-bit UTF-16 standard, represented by pairs of bytes, is also popular.
The default encoding for files is derived from the Prolog flag
encoding, which is
setlocale(LC_CTYPE, NULL) to one of
iso_latin_1. One of the latter two is used if the encoding
name is recognized, while
text is used as default. Using
text, the translation is left to the wide-character
functions of the C library.40The
Prolog native UTF-8 mode is considerably faster than the generic
mbrtowc() one. The encoding can be specified explicitly in load_files/2
for loading Prolog source with an alternative encoding, open/4
when opening files or using set_stream/2
on any open stream. For Prolog source files we also provide the encoding/1
directive that can be used to switch between encodings that are
compatible with US-ASCII (
and many locales). See also section
3.1.3 for writing Prolog files with non-US-ASCII characters and section
126.96.36.199 for syntax issues. For additional information and Unicode
resources, please visit
SWI-Prolog currently defines and supports the following encodings:
binarystreams. This causes the stream to be read and written fully untranslated.
iso_latin_1, but generates errors and warnings on encountering values above 127.
iso_latin_1for Western languages and
utf8in a UTF-8 context.
ascii. See above. The above is the SWI-Prolog native name. This encoding may be specified using the official IANA name
utf16beis Big Endian, putting the most significant byte first and
utf16leis Little Endian, putting the most significant byte second. UTF-16 can represent full Unicode using surrogate pairs. The above are the SWI-Prolog native names. These encodings may be specified using the official IANA names
UTF-16LE. For backward compatibility we also support
Note that not all encodings can represent all characters. This implies that writing text to a stream may cause errors because the stream cannot represent these characters. The behaviour of a stream on these errors can be controlled using set_stream/2. Initially the terminal stream writes the characters using Prolog escape sequences while other streams generate an I/O exception.
2.19.1, you may have got the impression that text files are
complicated. This section deals with a related topic, making life often
easier for the user, but providing another worry to the programmer.
BOM or Byte Order Marker is a technique for identifying
Unicode text files as well as the encoding they use. Such files start
with the Unicode character 0xFEFF, a non-breaking, zero-width space
character. This is a pretty unique sequence that is not likely to be the
start of a non-Unicode file and uniquely distinguishes the various
Unicode file formats. As it is a zero-width blank, it even doesn't
produce any output. This solves all problems, or ... Some formats start
off as US-ASCII and may contain some encoding mark to switch to UTF-8,
such as the
encoding="UTF-8" in an XML header. Such formats
often explicitly forbid the use of a UTF-8 BOM. In other cases there is
additional information revealing the encoding, making the use of a BOM
redundant or even illegal.
The BOM is handled by SWI-Prolog open/4
predicate. By default, text files are probed for the BOM when opened for
reading. If a BOM is found, the encoding is set accordingly and the
bom(true) is available through stream_property/2.
When opening a file for writing, writing a BOM can be requested using