Next: srfi strings ratio naming, Previous: srfi strings ratio code-point, Up: srfi strings ratio [Index]
The major issue confronting this SRFI is the existence of super–ASCII character encodings, such as eight–bit Latin–1 or 16–bit and 32–bit Unicode. It is a design goal of this SRFI for the API to be portable across string implementations based on at least these three standard encodings. Unfortunately, this places strong limitations on the API design. Here are some relevant issues. Be warned that life in a super–ASCII world is significantly more complex; there are no easy answers for many of these issues.
Upper–casing and lower–casing characters is complex in super–ASCII encodings.
eszet
character upper–cases to SS
.
char-upcase
is not
well–defined, since it is defined to produce a (single) character
result.
string-upcase!
procedure cannot be
reliably defined, since the original string may not be long enough to
contain the result; an N–character string might upcase to a
2N–character result.
dz
. Uppercasing the
dz
character produces the compound character DZ
, while
titlecasing (or, as Americans say, capitalizing) it produces compound
character Dz
.
The Unicode Consortium’s web site:
has detailed discussions of the issues. See in particular technical report 21 on case mappings:
SRFI-13 makes no attempt to deal with these issues; it uses a simple one–to–ont locale–independent and context–independent case–mapping, specifically Unicode’s one–to–one case–mappings given in:
The format of this file is explained in:
Note that this means that German eszet
upper–cases to itself,
not SS
.
Case–mapping and case–folding operations in SRFI-13 are locale–independent so that shifting locales won’t wreck hash tables, b–trees, symbol tables, etc.
Comparing strings for equality is complicated because in some cases
Unicode actually provides multiple encodings for the “same” character,
and because what we usually think of as a “character” can be
represented in Unicode as a sequence of several code–points. For
example, consider the letter e
with an acute accent. There is a
single Unicode character for this. However, Unicode also allows one to
represent this with a two–character sequence: the e
character
followed by a zero–width acute–accent character. As another example,
Unicode provides some Asian characters in “narrow” and “full”
widths.
There are multiple ways we might want to compare strings for equality. In (roughly) decreasing order of precision:
<e-acute>
would not compare equal to <e, acute>
;
This library does not address these complexities. SRFI-13 string equality is simply based upon comparing the encoding values used for the characters. Accent–insensitive and other types of comparison are not provided; only a simple form of case–insensitive comparison is provided, which uses the one–to–one case mappings specified by Unicode in:
These are adequate for “program” or “systems” use of strings (e.g. to manipulate program identifiers and operating–system filenames).
Above and beyond the issues arising in string–equality, when we attempt to order strings there are even further considerations.
a
in a case–insensitive comparison?
eszet
character should sort as if it were the pair of
letters ss
.
Unicode defines a complex set of machinery for ordering or “collating” strings, which involves mapping each string to a multi–byte sort key, and then doing simple lexicographic sorting with these keys. These rules can be overlaid by additional domain–specific or language–specific rules. Again, this SRFI does not address these issues. SRFI-13 string ordering is strictly based upon a character–by–character comparison of the values used for representing the string.
Next: srfi strings ratio naming, Previous: srfi strings ratio code-point, Up: srfi strings ratio [Index]