Vicare Scheme: iklib chars unicode intro

6.22.3.1 Introduction to Unicode

The mandatory starting points to learn about Unicode are the following URLs:

http://www.unicode.org/faq/utf_bom.html

http://en.wikipedia.org/wiki/Universal_Character_Set

http://en.wikipedia.org/wiki/Unicode

http://en.wikipedia.org/wiki/Byte_order_mark

http://en.wikipedia.org/wiki/UTF-8

http://en.wikipedia.org/wiki/UTF-16

http://en.wikipedia.org/wiki/UTF-32

here we give only a brief overview of the main definitions, drawing text from those pages. Let’s not forget the main source:

http://www.unicode.org/

The Universal Character Set (UCS) is a standard set of characters upon which many character encodings are based; it contains abstract characters, each identified by an unambiguous name and an integer number called its code point.

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world’s writing systems.

UCS and Unicode have an identical repertoire and numbers: the same characters with the same numbers exist in both standards. UCS is a simple character map, Unicode adds rules for collation, normalization of forms, and the bidirectional algorithm for scripts.

The Unicode Consortium, the non–profit organization that coordinates Unicode’s development, has the goal of eventually replacing existing character encoding schemes with Unicode and its standard “Unicode Transformation Format” alias “UCS Transformation Format” (UTF) schemes.

By convention a Unicode code point is referred to by writing ‘U+’ followed by its hexadecimal number with at least 4 digits (‘U+0044’ is fine, ‘U+12’ is not).

In practice, Unicode code points are exact integers in the range ‘[0, #x10FFFF]’, but outside the range ‘[#xD800, #xDFFF]’ which has special meaning in UTF schemes. A code point can be stored in 21 bits:

(string-length (number->string #x10FFFF 2)) ⇒ 21

R6RS defines fixnums to have at least 24 bits, so a fixnum is wide enough to hold a code point:

(fixnum? #x10FFFF) ⇒ #t

and indeed Scheme characters are a disjoint type of value holding such fixnums:

(integer->char #x10FFFF) ⇒ #\x10FFFF