Next: iklib chars unicode utf32, Previous: iklib chars unicode utf8, Up: iklib chars unicode [Index]
UTF-16 is a multioctet character encoding for Unicode which can represent every character in the Unicode set: it can represent every code point in the ranges ‘[0, #xD800)’ and ‘(#xDFFF, #x10FFFF]’.
Code points in the range ‘[0, #x10000)’ are encoded with a single UTF-16 16-bit word; code points in the range ‘[#x10000, #x10FFFF]’ are encoded in a surrogate pair of 2 16-bit words.
Given a 16-bit word in a UTF-16 stream, represented in Scheme as a fixnum in the range ‘[#x0000, #xFFFF]’, we can classify it on the following axis:
0000 D7FF D800 DBFF DC00 DFFF E000 FFFF |-------------||-----------||-------------||------------| single word first in second in single word character pair pair character
or the following logic:
word in [#x0000, #xD7FF] => single word character word in [#xD800, #xDBFF] => first in surrogate pair word in [#xDC00, #xDFFF] => second in surrogate pair word in [#xE000, #xFFFF] => single word character
A UTF-16 stream may start with a Byte Order Mark (BOM). A UTF-16 BOM is either:
The following syntactic bindings are exported by the library
(vicare unsafe unicode)
. The following macros assume the
word arguments are fixnums representing 16-bit words: they must be
in the range ‘[0, #xFFFF]’; while the code-point arguments
are fixnums representing Unicode code points (they are in the range
‘[0, #x10FFFF]’, but outside the range ‘[#xD800, #xDFFF]’).
Evaluate to #t
if word0 is valid as single 16-bit word
UTF-16 encoding of a Unicode character; otherwise evaluate to
#f
.
Decode the integer representation of a Unicode character from a 16-bit single word UTF-16 encoding.
Evaluate to #t
if word0 is valid as first 16-bit word in a
surrogate pair UTF-16 encoding of a Unicode character; otherwise
evaluate to #f
.
Evaluate to #t
if word1 is valid as second 16-bit word in a
surrogate pair UTF-16 encoding of a Unicode character; otherwise
evaluate to #f
.
Decode the integer representation of a Unicode character from a surrogate pair UTF-16 encoding.
Evaluate to #t
if code-point is the fixnum representation of
a Unicode code point representable as single 16-bit word UTF-16
encoding; otherwise evaluate to #f
.
Encode code-point as single 16-bit word UTF-16 encoding.
Evaluate to #t
if code-point is the fixnum representation of
a Unicode code point representable as surrogate pair of two 16-bit words
UTF-16 encoding; otherwise evaluate to #f
.
Encode code-point as first 16-bit word in a surrogate pair UTF-16 encoding.
Encode code-point as second 16-bit word in a surrogate pair UTF-16 encoding.
Next: iklib chars unicode utf32, Previous: iklib chars unicode utf8, Up: iklib chars unicode [Index]