Vicare Scheme: unicode

As required by R6RS, the Scheme implementations must support Unicode characters and input/output libraries must implement transcoders for textual ports supporting encoding and decoding between Scheme characters and UTF-8 and UTF-16.

The mandatory starting points to learn about this stuff are the following (URLs last verified on Sep 9, 2011):

here we give only a brief overview of the main definitions, drawing text from those pages.

The Universal Character Set (UCS) is a standard set of characters upon which many character encodings are based; it contains abstract characters, each identified by an unambiguous name and an integer number called its “code point”.

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world’s writing systems.

UCS and Unicode have an identical repertoire and numbers: the same characters with the same numbers exist in both standards. UCS is a simple character map, Unicode adds rules for collation, normalization of forms, and the bidirectional algorithm for scripts.

The Unicode Consortium, the nonprofit organization that coordinates Unicode’s development, has the goal of eventually replacing existing character encoding schemes with Unicode and its standard “Unicode Transformation Format” alias “UCS Transformation Format” (UTF) schemes.

By convention a Unicode code point is referred to by writing U+ followed by its hexadecimal number with at least 4 digits (U+0044 is fine, U+12 is not).

In practice, Unicode code points are exact integers in the range [0, £x10FFFF], but outside the range [£xD800, £xDFFF] which has special meaning in UTF schemes. A code point can be stored in 21 bits:

(string-length (number->string #x10FFFF 2)) ⇒ 21

R6RS defines fixnums to have at least 24 bits, so a fixnum is wide enough to hold a code point:

(fixnum? #x10FFFF) ⇒ #t

(integer->char #x10FFFF) ⇒ #\x10FFFF

UTF-8 scheme

UTF-8 is a multioctet character encoding for Unicode which can represent every character in the Unicode set, that is it can represent every code point in the ranges [0, #xD800) and

(#xDFFF,
#x10FFFF]

A stream of UTF-8 encoded characters is meant to be stored octet by octet in fixed order (and so without the need to specify the endianness of words).

The encoding scheme uses sequences of 1, 2, 3 or 4 octets to encode a each code point as shown in the following table; the first octet in a sequence has a unique bit pattern in the most significant bits and so it allows the determination of the sequence length; every octet contains a number of payload bits which must be concatenated (bitwise inclusive OR) to reconstruct the integer representation of a code point:

| # of octets | 1st octet | 2nd octet | 3rd octet | 4th octet |
|-------------+-----------+-----------+-----------+-----------|
|     1        #b0xxxxxxx
|     2        #b110xxxxx  #b10xxxxxx
|     3        #b1110xxxx  #b10xxxxxx  #b10xxxxxx
|     4        #b11110xxx  #b10xxxxxx  #b10xxxxxx  #b10xxxxxx

| # of octets | # of payload bits |       hex range     |
|-------------+-------------------+---------------------|
|     1                         7    [#x0000,   #x007F]
|     2                5 + 6 = 11    [#x0080,   #x07FF]
|     3            4 + 6 + 6 = 16    [#x0800,   #xFFFF]
|     4        3 + 6 + 6 + 6 = 21  [#x010000, #x10FFFF]

Note that octets #xFE and #xFF cannot appear in a valid stream of UTF-8 encoded characters. The sequence of 3 octets is the one that could encode (but must not) the forbidden range [#xD800, #xDFFF].

The first 128 characters of the Unicode character set correspond one–to–one with ASCII and are encoded using a single octet with the same binary value as the corresponding ASCII character, making valid ASCII text valid UTF-8 encoded Unicode text as well. Such encoded octets have the Most Significant Bit (MSB) set to zero.

Although the standard does not define it, many programs start a UTF-8 stream with a Byte Order Mark (BOM) composed of the 3 octets: #xEF, #xBB, #xBF.

UTF-16 decoding

Given a 16-bit word in a UTF-16 stream, represented in Scheme as a fixnum in the range [#x0000, #xFFFF], we can classify it on the following axis:

0000        D7FF D800    DBFF DC00      DFFF E000       FFFF
 |-------------||-----------||-------------||------------|
  single word    first in     second in      single word
  character      pair         pair           character

word in [#x0000, #xD7FF] => single word character
word in [#xD800, #xDBFF] => first in surrogate pair
word in [#xDC00, #xDFFF] => second in surrogate pair
word in [#xE000, #xFFFF] => single word character

Latin-1 uses 1 octet per character. The first 256 Unicode code points are identical to the content of Latin-1, the first 127 Latin-1 code points are identical to ASCII. For an itroduction see:

Latin-1 code points in the range [0, 127] are identical to the same code points encoded in both ASCII and in UTF-8.

Latin-1 code points in the range [128, 255] are different from the same code points encoded in UTF-8.

Every octet (that is: every fixnum in the range [0, 255]) can be interpreted as a character in Latin-1 encoding.

Appendix F On Unicode and UTF encodings

UTF-8 scheme

UTF-16 decoding