Libraries for Vicare Scheme: srfi strings ratio i18n

2.8.2.2 Internationalisation and super-ASCII character types

The major issue confronting this SRFI is the existence of super–ASCII character encodings, such as eight–bit Latin–1 or 16–bit and 32–bit Unicode. It is a design goal of this SRFI for the API to be portable across string implementations based on at least these three standard encodings. Unfortunately, this places strong limitations on the API design. Here are some relevant issues. Be warned that life in a super–ASCII world is significantly more complex; there are no easy answers for many of these issues.

Case mapping and case–folding

Upper–casing and lower–casing characters is complex in super–ASCII encodings.

Some characters case–map to more than one character. For example, the Latin–1 German eszet character upper–cases to SS.
- – This means that the R5RS function char-upcase is not well–defined, since it is defined to produce a (single) character result.
- – It means that an in–place string-upcase! procedure cannot be reliably defined, since the original string may not be long enough to contain the result; an N–character string might upcase to a 2N–character result.
- – It means that case–insensitive string–matching or searching is quite tricky. For example, an n–character string s might match a 2N–character string s’.
Some characters case–map in different ways depending upon their surrounding context. For example, the Unicode Greek capital sigma character downcases differently depending upon whether or not it is the final character in a word. Again, this spells trouble for the simple R5RS char–downcase function.
Unicode defines three cases: lowercase, uppercase and titlecase. The distinction between uppercase and titlecase arises in the presence of Unicode’s compound characters. For example, Unicode has a single character representing the compound pair dz. Uppercasing the dz character produces the compound character DZ, while titlecasing (or, as Americans say, capitalizing) it produces compound character Dz.
Turkish actually has different case–mappings from other languages.

The Unicode Consortium’s web site:

http://www.unicode.org/

has detailed discussions of the issues. See in particular technical report 21 on case mappings:

http://www.unicode.org/unicode/reports/tr21/

SRFI-13 makes no attempt to deal with these issues; it uses a simple one–to–ont locale–independent and context–independent case–mapping, specifically Unicode’s one–to–one case–mappings given in:

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

The format of this file is explained in:

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html

Note that this means that German eszet upper–cases to itself, not SS.

Case–mapping and case–folding operations in SRFI-13 are locale–independent so that shifting locales won’t wreck hash tables, b–trees, symbol tables, etc.

String equality and string normalisation

Comparing strings for equality is complicated because in some cases Unicode actually provides multiple encodings for the “same” character, and because what we usually think of as a “character” can be represented in Unicode as a sequence of several code–points. For example, consider the letter e with an acute accent. There is a single Unicode character for this. However, Unicode also allows one to represent this with a two–character sequence: the e character followed by a zero–width acute–accent character. As another example, Unicode provides some Asian characters in “narrow” and “full” widths.

There are multiple ways we might want to compare strings for equality. In (roughly) decreasing order of precision:

we might want a precise comparison of the actual encoding, so that <e-acute> would not compare equal to <e, acute>;
we might want a “normalised” comparison, where these two sequences would compare equal;
we might want an even more–permissive normalisation, where visually–distinct properties of “the same” character would be ignored; for example, we might want narrow/full–width versions of the same Asian character to compare equal;
we might want comparisons that are insensitive to accents and diacritical marks;
we might want comparisons that are case–insensitive;
we might want comparisons that are insensitive to several of the above properties;
we might want ways to “normalise” strings into various canonical forms.

This library does not address these complexities. SRFI-13 string equality is simply based upon comparing the encoding values used for the characters. Accent–insensitive and other types of comparison are not provided; only a simple form of case–insensitive comparison is provided, which uses the one–to–one case mappings specified by Unicode in:

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

These are adequate for “program” or “systems” use of strings (e.g. to manipulate program identifiers and operating–system filenames).

String inequality

Above and beyond the issues arising in string–equality, when we attempt to order strings there are even further considerations.

French orders accents with right–to–left significance; the reverse of the significance of the characters.
Case–insensitive ordering is not well defined by simple code–point considerations, even for simple ASCII: there are punctuation characters between the ASCII’s upper–case range of letters and its lower–case range (left–bracket, backslash, right–bracket, caret, underbar and backquote). Does left–bracket compare less–than or greater–than a in a case–insensitive comparison?
The German eszet character should sort as if it were the pair of letters ss.

Unicode defines a complex set of machinery for ordering or “collating” strings, which involves mapping each string to a multi–byte sort key, and then doing simple lexicographic sorting with these keys. These rules can be overlaid by additional domain–specific or language–specific rules. Again, this SRFI does not address these issues. SRFI-13 string ordering is strictly based upon a character–by–character comparison of the values used for representing the string.