Vicare Scheme: iklib chars unicode utf16

6.22.3.3 Unicode’s UTF-16 encoding

UTF-16 is a multioctet character encoding for Unicode which can represent every character in the Unicode set: it can represent every code point in the ranges ‘[0, #xD800)’ and ‘(#xDFFF, #x10FFFF]’.

Code points in the range ‘[0, #x10000)’ are encoded with a single UTF-16 16-bit word; code points in the range ‘[#x10000, #x10FFFF]’ are encoded in a surrogate pair of 2 16-bit words.

Given a 16-bit word in a UTF-16 stream, represented in Scheme as a fixnum in the range ‘[#x0000, #xFFFF]’, we can classify it on the following axis:

0000        D7FF D800    DBFF DC00      DFFF E000       FFFF
 |-------------||-----------||-------------||------------|
  single word    first in     second in      single word
  character      pair         pair           character

or the following logic:

word in [#x0000, #xD7FF] => single word character
word in [#xD800, #xDBFF] => first in surrogate pair
word in [#xDC00, #xDFFF] => second in surrogate pair
word in [#xE000, #xFFFF] => single word character

A UTF-16 stream may start with a Byte Order Mark (BOM). A UTF-16 BOM is either:

A sequence of bytes ‘#xFE’, ‘#xFF’ specifying “big endianness” and UTF-16BE encoding.
A sequence of bytes ‘#xFF’, ‘#xFE’ specifying “little endianness” and UTF-16LE encoding.

The following syntactic bindings are exported by the library (vicare unsafe unicode). The following macros assume the word arguments are fixnums representing 16-bit words: they must be in the range ‘[0, #xFFFF]’; while the code-point arguments are fixnums representing Unicode code points (they are in the range ‘[0, #x10FFFF]’, but outside the range ‘[#xD800, #xDFFF]’).

1-word decoding

Syntax: utf-16-single-word? word0: Evaluate to #t if word0 is valid as single 16-bit word UTF-16 encoding of a Unicode character; otherwise evaluate to #f.

Syntax: utf-16-decode-single-word word0: Decode the integer representation of a Unicode character from a 16-bit single word UTF-16 encoding.

2-words decoding

Syntax: utf-16-first-of-two-words? word0: Evaluate to #t if word0 is valid as first 16-bit word in a surrogate pair UTF-16 encoding of a Unicode character; otherwise evaluate to #f.

Syntax: utf-16-second-of-two-words? word1: Evaluate to #t if word1 is valid as second 16-bit word in a surrogate pair UTF-16 encoding of a Unicode character; otherwise evaluate to #f.

Syntax: utf-16-decode-surrogate-pair word0 word2: Decode the integer representation of a Unicode character from a surrogate pair UTF-16 encoding.

1-word encoding

Syntax: utf-16-single-word-code-point? code-point: Evaluate to #t if code-point is the fixnum representation of a Unicode code point representable as single 16-bit word UTF-16 encoding; otherwise evaluate to #f.

Syntax: utf-16-encode-single-word code-point: Encode code-point as single 16-bit word UTF-16 encoding.

2-word encoding

Syntax: utf-16-two-words-code-point? code-point: Evaluate to #t if code-point is the fixnum representation of a Unicode code point representable as surrogate pair of two 16-bit words UTF-16 encoding; otherwise evaluate to #f.

Syntax: utf-16-encode-first-of-two-words code-point: Encode code-point as first 16-bit word in a surrogate pair UTF-16 encoding.

Syntax: utf-16-encode-second-of-two-words code-point: Encode code-point as second 16-bit word in a surrogate pair UTF-16 encoding.