Next: , Previous: , Up: srfi strings ratio   [Index]


2.8.2.2 Internationalisation and super-ASCII character types

The major issue confronting this SRFI is the existence of super–ASCII character encodings, such as eight–bit Latin–1 or 16–bit and 32–bit Unicode. It is a design goal of this SRFI for the API to be portable across string implementations based on at least these three standard encodings. Unfortunately, this places strong limitations on the API design. Here are some relevant issues. Be warned that life in a super–ASCII world is significantly more complex; there are no easy answers for many of these issues.

Case mapping and case–folding

Upper–casing and lower–casing characters is complex in super–ASCII encodings.

The Unicode Consortium’s web site:

http://www.unicode.org/

has detailed discussions of the issues. See in particular technical report 21 on case mappings:

http://www.unicode.org/unicode/reports/tr21/

SRFI-13 makes no attempt to deal with these issues; it uses a simple one–to–ont locale–independent and context–independent case–mapping, specifically Unicode’s one–to–one case–mappings given in:

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

The format of this file is explained in:

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html

Note that this means that German eszet upper–cases to itself, not SS.

Case–mapping and case–folding operations in SRFI-13 are locale–independent so that shifting locales won’t wreck hash tables, b–trees, symbol tables, etc.

String equality and string normalisation

Comparing strings for equality is complicated because in some cases Unicode actually provides multiple encodings for the “same” character, and because what we usually think of as a “character” can be represented in Unicode as a sequence of several code–points. For example, consider the letter e with an acute accent. There is a single Unicode character for this. However, Unicode also allows one to represent this with a two–character sequence: the e character followed by a zero–width acute–accent character. As another example, Unicode provides some Asian characters in “narrow” and “full” widths.

There are multiple ways we might want to compare strings for equality. In (roughly) decreasing order of precision:

This library does not address these complexities. SRFI-13 string equality is simply based upon comparing the encoding values used for the characters. Accent–insensitive and other types of comparison are not provided; only a simple form of case–insensitive comparison is provided, which uses the one–to–one case mappings specified by Unicode in:

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

These are adequate for “program” or “systems” use of strings (e.g. to manipulate program identifiers and operating–system filenames).

String inequality

Above and beyond the issues arising in string–equality, when we attempt to order strings there are even further considerations.

Unicode defines a complex set of machinery for ordering or “collating” strings, which involves mapping each string to a multi–byte sort key, and then doing simple lexicographic sorting with these keys. These rules can be overlaid by additional domain–specific or language–specific rules. Again, this SRFI does not address these issues. SRFI-13 string ordering is strictly based upon a character–by–character comparison of the values used for representing the string.


Next: , Previous: , Up: srfi strings ratio   [Index]