Next: iklib chars unicode utf16, Previous: iklib chars unicode intro, Up: iklib chars unicode [Index]
UTF-8 is a multioctet character encoding for Unicode which can represent every character in the Unicode set: it can represent every code point in the ranges ‘[0, #xD800)’ and ‘(#xDFFF, #x10FFFF]’.
A stream of UTF-8 encoded characters is meant to be stored octet by octet in fixed order (and so without the need to specify the endianness of words).
The encoding scheme uses sequences of 1, 2, 3 or 4 octets to encode each code point as shown in the following table; the first octet in a sequence has a unique bit pattern in the most significant bits and so it allows the determination of the sequence length; every octet contains a number of payload bits which must be concatenated (bitwise inclusive OR) to reconstruct the integer representation of a code point:
# of octets | 1st octet | 2nd octet | 3rd octet | 4th octet | ------------+------------+------------+------------+------------| 1 #b0xxxxxxx 2 #b110xxxxx #b10xxxxxx 3 #b1110xxxx #b10xxxxxx #b10xxxxxx 4 #b11110xxx #b10xxxxxx #b10xxxxxx #b10xxxxxx
# of octets | # of payload bits | hex range -------------+--------------------+---------------------- 1 7 [#x0000, #x007F] 2 5 + 6 = 11 [#x0080, #x07FF] 3 4 + 6 + 6 = 16 [#x0800, #xFFFF] 4 3 + 6 + 6 + 6 = 21 [#x010000, #x10FFFF]
Note that octets ‘#xFE’ (‘#b11111110’) and ‘#xFF’ (‘#b11111111’) cannot appear in a valid stream of UTF-8 encoded characters.
The sequence of 3 octets is the one that could encode (but must not) the forbidden range ‘[#xD800, #xDFFF]’ which are not Unicode code points. So the table of valid encoded code points is:
# of octets | # of payload bits | code point range -------------+--------------------+---------------------- 1 | 7 | [#x0000, #x007F] 2 | 5 + 6 = 11 | [#x0080, #x07FF] 3 | 4 + 6 + 6 = 16 | [#x0800, #xD7FF] 3 | 4 + 6 + 6 = 16 | [#xE000, #xFFFF] 4 | 3 + 6 + 6 + 6 = 21 | [#x010000, #x10FFFF]
The first 128 characters of the Unicode character set correspond one–to–one with ASCII and are encoded using a single octet with the same binary value as the corresponding ASCII character, making valid ASCII text valid UTF-8 encoded Unicode text as well. Such encoded octets have the Most Significant Bit (MSB) set to zero.
Although the standard does not define it, many programs start a UTF-8 stream with a Byte Order Mark (BOM) composed of the 3 octets: ‘#xEF’, ‘#xBB’, ‘#xBF’.
The following syntactic bindings are exported by the library
(vicare unsafe unicode)
. All the macros are unsafe: no
validation on the type of the arguments is performed. For all the
macros: the argument octet is meant to be a fixnum representing 1
octet (in the range ‘[0, 255]’); the argument code-point is
meant to be a fixnum representing a Unicode code point (in the range
‘[0, #x10FFFF]’, but outside the range ‘[#xD800, #xDFFF]’).
Evaluate to #t
if octet has a value that must never appear in
a valid UTF-8 stream; otherwise evaluate to #f
.
Evaluate to #t
if octet is valid as 1-octet UTF-8 encoding
of a Unicode character; otherwise evaluate to #f
.
Decode the code point of a Unicode character from a 1-octet UTF-8 encoding.
Evaluate to true if code-point is a valid fixnum representation for a code point decoded from a 2-octets UTF-8 sequence.
Evaluate to #t
if octet0 is valid as first of 2-octets
UTF-8 encoding of a Unicode character.
Evaluate to true if octet1 is valid as second of 2-octets UTF-8 encoding of a Unicode character.
Decode the code point of a Unicode character from a 2-octets UTF-8 encoding.
Evaluate to #t
if code-point is a valid fixnum representation
for a code point decoded from a 2-octets UTF-8 sequence.
Evaluate to #t
if octet0 is valid as first of 3-octets
UTF-8 encoding of a Unicode character; otherwise evaluate to
#f
.
Evaluate to #t
if octet1 and octet2 are valid as
second and third of 3-octets UTF-8 encoding of a Unicode character.
Decode the code point of a Unicode character from a 3-octets UTF-8 encoding.
Evaluate to #t
if code-point is a valid integer
representation for a code point decoded from a 3-octets UTF-8
sequence.
Evaluate to #t
if octet0 is valid as first of 4-octets
UTF-8 encoding of a Unicode character.
Evaluate to true if octet1, octet2 and octet3 are valid as second, third and fourth of 4-octets UTF-8 encoding of a Unicode character.
Decode the code point of a Unicode character from a 4-octets UTF-8 encoding.
Evaluate to #t
if code-point is a valid integer
representation for a code point decoded from a 4-octets UTF-8
sequence.
Evaluate to #t
if code-point is a Unicode code point
representable as 1-octet UTF-8 encoding; otherwise evaluate to
#f
.
Encode the code point of a Unicode character to a 1-octet UTF-8 encoding.
Evaluate to #t
if code-point is a Unicode code point
representable as 2-octet UTF-8 encoding; otherwise evaluate to
#f
.
Encode the code point of a Unicode character to the first octet in a 2-octet UTF-8 encoding.
Encode the code point of a Unicode character to the second octet in a 2-octet UTF-8 encoding.
Evaluate to #t
if code-point is a Unicode code point
representable as 3-octet UTF-8 encoding; otherwise evaluate to
#f
.
Encode the code point of a Unicode character to the first octet in a 3-octet UTF-8 encoding.
Encode the code point of a Unicode character to the second octet in a 3-octet UTF-8 encoding.
Encode the code point of a Unicode character to the fourth octet in a 3-octet UTF-8 encoding.
Evaluate to #t
if code-point is a Unicode code point
representable as 4-octet UTF-8 encoding; otherwise evaluate to
#f
.
Encode the code point of a Unicode character to the first octet in a 4-octet UTF-8 encoding.
Encode the code point of a Unicode character to the second octet in a 4-octet UTF-8 encoding.
Encode the code point of a Unicode character to the third octet in a 4-octet UTF-8 encoding.
Encode the code point of a Unicode character to the fourth octet in a 4-octet UTF-8 encoding.
Next: iklib chars unicode utf16, Previous: iklib chars unicode intro, Up: iklib chars unicode [Index]