Vicare Scheme: iklib chars unicode utf8

6.22.3.2 Unicode’s UTF-8 encoding

UTF-8 is a multioctet character encoding for Unicode which can represent every character in the Unicode set: it can represent every code point in the ranges ‘[0, #xD800)’ and ‘(#xDFFF, #x10FFFF]’.

A stream of UTF-8 encoded characters is meant to be stored octet by octet in fixed order (and so without the need to specify the endianness of words).

The encoding scheme uses sequences of 1, 2, 3 or 4 octets to encode each code point as shown in the following table; the first octet in a sequence has a unique bit pattern in the most significant bits and so it allows the determination of the sequence length; every octet contains a number of payload bits which must be concatenated (bitwise inclusive OR) to reconstruct the integer representation of a code point:

# of octets | 1st octet  | 2nd octet  | 3rd octet  | 4th octet  |
------------+------------+------------+------------+------------|
     1        #b0xxxxxxx
     2        #b110xxxxx   #b10xxxxxx
     3        #b1110xxxx   #b10xxxxxx   #b10xxxxxx
     4        #b11110xxx   #b10xxxxxx   #b10xxxxxx   #b10xxxxxx

 # of octets | # of payload bits  |       hex range
-------------+--------------------+----------------------
     1                          7     [#x0000,   #x007F]
     2                 5 + 6 = 11     [#x0080,   #x07FF]
     3             4 + 6 + 6 = 16     [#x0800,   #xFFFF]
     4         3 + 6 + 6 + 6 = 21   [#x010000, #x10FFFF]

Note that octets ‘#xFE’ (‘#b11111110’) and ‘#xFF’ (‘#b11111111’) cannot appear in a valid stream of UTF-8 encoded characters.

The sequence of 3 octets is the one that could encode (but must not) the forbidden range ‘[#xD800, #xDFFF]’ which are not Unicode code points. So the table of valid encoded code points is:

 # of octets |  # of payload bits |    code point range
-------------+--------------------+----------------------
     1       |                  7 |   [#x0000,   #x007F]
     2       |        5 + 6 = 11  |   [#x0080,   #x07FF]
     3       |     4 + 6 + 6 = 16 |   [#x0800,   #xD7FF]
     3       |     4 + 6 + 6 = 16 |   [#xE000,   #xFFFF]
     4       | 3 + 6 + 6 + 6 = 21 | [#x010000, #x10FFFF]

The first 128 characters of the Unicode character set correspond one–to–one with ASCII and are encoded using a single octet with the same binary value as the corresponding ASCII character, making valid ASCII text valid UTF-8 encoded Unicode text as well. Such encoded octets have the Most Significant Bit (MSB) set to zero.

Although the standard does not define it, many programs start a UTF-8 stream with a Byte Order Mark (BOM) composed of the 3 octets: ‘#xEF’, ‘#xBB’, ‘#xBF’.

The following syntactic bindings are exported by the library (vicare unsafe unicode). All the macros are unsafe: no validation on the type of the arguments is performed. For all the macros: the argument octet is meant to be a fixnum representing 1 octet (in the range ‘[0, 255]’); the argument code-point is meant to be a fixnum representing a Unicode code point (in the range ‘[0, #x10FFFF]’, but outside the range ‘[#xD800, #xDFFF]’).

Syntax: utf-8-invalid-octet? octet: Evaluate to #t if octet has a value that must never appear in a valid UTF-8 stream; otherwise evaluate to #f.

Decoding 1-octet UTF-8 to code points

Syntax: utf-8-single-octet? octet: Evaluate to #t if octet is valid as 1-octet UTF-8 encoding of a Unicode character; otherwise evaluate to #f.

Syntax: utf-8-decode-single-octet octet: Decode the code point of a Unicode character from a 1-octet UTF-8 encoding.

Syntax: utf-8-valid-code-point-from-1-octet? code-point: Evaluate to true if code-point is a valid fixnum representation for a code point decoded from a 2-octets UTF-8 sequence.

Decoding 2-octets UTF-8 to code points

Syntax: utf-8-first-of-two-octets? octet0: Evaluate to #t if octet0 is valid as first of 2-octets UTF-8 encoding of a Unicode character.

Syntax: utf-8-second-of-two-octets? octet1: Evaluate to true if octet1 is valid as second of 2-octets UTF-8 encoding of a Unicode character.

Syntax: utf-8-decode-two-octets octet0 octet1: Decode the code point of a Unicode character from a 2-octets UTF-8 encoding.

Syntax: utf-8-valid-code-point-from-2-octets? code-point: Evaluate to #t if code-point is a valid fixnum representation for a code point decoded from a 2-octets UTF-8 sequence.

Decoding 3-octets UTF-8 to code points

Syntax: utf-8-first-of-three-octets? octet0: Evaluate to #t if octet0 is valid as first of 3-octets UTF-8 encoding of a Unicode character; otherwise evaluate to #f.

Syntax: utf-8-second-and-third-of-three-octets? octet1 octet2: Evaluate to #t if octet1 and octet2 are valid as second and third of 3-octets UTF-8 encoding of a Unicode character.

Syntax: utf-8-decode-three-octets octet0 octet1 octet2: Decode the code point of a Unicode character from a 3-octets UTF-8 encoding.

Syntax: utf-8-valid-code-point-from-3-octets? code-point: Evaluate to #t if code-point is a valid integer representation for a code point decoded from a 3-octets UTF-8 sequence.

Decoding 4-octets UTF-8 to code points

Syntax: utf-8-first-of-four-octets? octet0: Evaluate to #t if octet0 is valid as first of 4-octets UTF-8 encoding of a Unicode character.

Syntax: utf-8-second-third-and-fourth-of-four-octets? octet1 octet2 octet3: Evaluate to true if octet1, octet2 and octet3 are valid as second, third and fourth of 4-octets UTF-8 encoding of a Unicode character.

Syntax: utf-8-decode-four-octets octet0 octet1 octet2 octet3: Decode the code point of a Unicode character from a 4-octets UTF-8 encoding.

Syntax: utf-8-valid-code-point-from-4-octets? code-point: Evaluate to #t if code-point is a valid integer representation for a code point decoded from a 4-octets UTF-8 sequence.

Encoding code points to 1-octet UTF-8

Syntax: utf-8-single-octet-code-point? code-point: Evaluate to #t if code-point is a Unicode code point representable as 1-octet UTF-8 encoding; otherwise evaluate to #f.

Syntax: utf-8-encode-single-octet code-point: Encode the code point of a Unicode character to a 1-octet UTF-8 encoding.

Encoding code points to 2-octet UTF-8

Syntax: utf-8-two-octets-code-point? code-point: Evaluate to #t if code-point is a Unicode code point representable as 2-octet UTF-8 encoding; otherwise evaluate to #f.

Syntax: utf-8-encode-first-of-two-octets code-point: Encode the code point of a Unicode character to the first octet in a 2-octet UTF-8 encoding.

Syntax: utf-8-encode-second-of-two-octets code-point: Encode the code point of a Unicode character to the second octet in a 2-octet UTF-8 encoding.

Encoding code points to 3-octet UTF-8

Syntax: utf-8-three-octets-code-point? code-point: Evaluate to #t if code-point is a Unicode code point representable as 3-octet UTF-8 encoding; otherwise evaluate to #f.

Syntax: utf-8-encode-first-of-three-octets code-point: Encode the code point of a Unicode character to the first octet in a 3-octet UTF-8 encoding.

Syntax: utf-8-encode-second-of-three-octets code-point: Encode the code point of a Unicode character to the second octet in a 3-octet UTF-8 encoding.

Syntax: utf-8-encode-third-of-three-octets code-point: Encode the code point of a Unicode character to the fourth octet in a 3-octet UTF-8 encoding.

Encoding code points to 4-octet UTF-8

Syntax: utf-8-four-octets-code-point? code-point: Evaluate to #t if code-point is a Unicode code point representable as 4-octet UTF-8 encoding; otherwise evaluate to #f.

Syntax: utf-8-encode-first-of-four-octets code-point: Encode the code point of a Unicode character to the first octet in a 4-octet UTF-8 encoding.

Syntax: utf-8-encode-second-of-four-octets code-point: Encode the code point of a Unicode character to the second octet in a 4-octet UTF-8 encoding.

Syntax: utf-8-encode-third-of-four-octets code-point: Encode the code point of a Unicode character to the third octet in a 4-octet UTF-8 encoding.

Syntax: utf-8-encode-fourth-of-four-octets code-point: Encode the code point of a Unicode character to the fourth octet in a 4-octet UTF-8 encoding.