Next: , Previous: objects structs, Up: objects


12.10 Character objects

A Scheme character has two representations:

the least significant 32 bits of the two representations are equal. unicode for details on Unicode.

Let's say that machine words are 32-bit values, which means the word size is 4 bytes; then the representation of a character is:

     |    Unicode code point    | char tag
     
     |--------|--------|--------|--------|
       byte3    byte2    byte1    byte0

the least significant byte is set to #x0F: this “tags” the machine words which embed characters. On 64-bit machines, the layout is:

             Unused              |Unicode code point  |char tag
     |...........................|....................|......|
     
     |------|------|------|------|------|------|------|------|
      byte7  byte6  byte5  byte4  byte3  byte2  byte1  byte0

At the Scheme level: standalone characters are moved around as ikptr values, but when characters are stored in a string the ikptr value is converted to a 32-bit integer of type ikchar.

Basic operations

Standalone characters are encoded into ikptr values as follows:

     unsigned long   unicode_code_point = the_code_point;
     ikptr           s_char;
     
     s_char = (unicode_code_point << char_shift) | char_tag;

decoded to unsigned long values as follows:

     ikptr           s_char = the_character;
     unsigned long   unicode_code_point;
     
     unicode_code_point = s_char >> char_shift;

and identified as follows:

     ikptr   X = the_value;
     
     if (char_tag == (char_mask & X))
       it_is_a_character();
     else
       it_is_not();

Characters from a Scheme string are decoded from ikchar to unsigned long as follows:

     ikchar          ch = the_32bit_character;
     unsigned long   unicode_code_point;
     
     unicode_code_point = s_char >> char_shift;

and encoded from unsigned long to ikchar as follows:

     unsigned long   unicode_code_point = the_code_point;
     ikchar          ch;
     
     ch = (ikchar)((unicode_code_point << char_shift) | char_tag);
— Type Definition: ikchar

An alias for uint32_t used to store a Unicode code point tagged as character.

— Preprocessor Symbol: char_mask
— Preprocessor Symbol: char_tag

Integer values used to tag and recognise ikptr values representing characters. char_mask isolates the tag bits from an ikptr and char_tag represents the tag bits.

— Preprocessor Symbol: char_shift

Integer value representing the number of bits we must shift left to turn a C language long into a machine word ready to be tagged as Scheme character.

Convenience preprocessor macros
— Preprocessor Macro: int IK_IS_CHAR (ikptr X)

Evaluate to true if X is a Scheme character.

— Preprocessor Macro: ikptr IK_CHAR_FROM_INTEGER (unsigned long X)
— Preprocessor Macro: unsigned long IK_CHAR_TO_INTEGER (ikptr X)

Convert a Scheme character to and from an unsigned long value representing the Unicode code point.

— Preprocessor Macro: ikchar IK_CHAR32_FROM_INTEGER (unsigned long X)

Convert a unsigned long value representing the Unicode code point into a 32-bit integer representing a Scheme character to be stored into a string.