Next: char-sets sets basic, Previous: char-sets sets intro, Up: char-sets sets [Index]
The following methods were employed to define the character sets.
(vicare containers
char-sets)
library itself.
(vicare
containers char-sets blocks)
library (there are a lot of them).
Ll
, Lu
, etc.) has been converted to
a set; they are exported by the (vicare containers char-sets
categories)
library (there are a lot of them). Notice that not all the
code points accepted by char=?
are part of a general category, so
the union of the category sets is different from the full character set.
On a Unix–like system, the following commands can be used to inspect the UnicodeData.txt database:
wc -l <UnicodeData.txt
Count the lines in the database: One line for each code point. Notice that not all the code points have a line in this file; the file format allows ranges, so some pairs of lines represent ranges of characters.
cut -d';' -f3 <UnicodeData.txt | sort | uniq
Print the category codes in the database (use sort
first,
because uniq
removes repeated lines only if they are
adjacent). Notice that the Cs
category (surrogate characters) is
present in the database, but excluded from the Vicare libraries,
because it describes the range [#xD800, #xDFFF]
forbidden by
R6RS.
grep ';Cs;' <UnicodeData.txt
Print only the lines describing the surrogate characters. They are six lines representing the three ranges:
[#xD800, #xDB7F]
Non private use high surrogate.
[#xDB80, #xDBFF]
Private use high surrogate.
[#xDC00, #xDFFF]
Low surrogate.
Notice that these ranges are adjacent and their union is the range
[#xD800, #xDFFF]
.
grep -v ';Cs;' <UnicodeData.txt | wc -l
Count the lines excluding the surrogate characters. The count should be 19330 (last verified with the database downloaded Wed Jun 23, 2009).
grep ';Ll;' <UnicodeData.txt
Extract all the lines describing the Ll
category.
grep ', *\(First\|Last\)>' <UnicodeData.txt
Extract all the lines describing the inclusive limit of a range of characters.
grep -v ', *\(First\|Last\)>' <UnicodeData.txt
Extract all the lines describing a single code point, excluding the lines describing the limit of a range of characters.
The following Bourne shell script processes the UnicodeData.txt
database and prints a Scheme program that, when evaluated, prints the
definitions of category character sets. The output script program makes
use of the (vicare containers char-sets)
library itself.
# unicode-database-extract-category-code-points.sh -- # DATABASE=${1:?'missing UnicodeData.txt pathname'} CATEGORY_CODES=$(cut -d';' -f3 <"$DATABASE" | sort | uniq | grep -v Cs) echo '(import (rnrs) (vicare containers char-sets))' for CATEGORY in $CATEGORY_CODES do echo processing category $CATEGORY >&2 echo -n "(define category-$CATEGORY (quote (" { grep ";$CATEGORY;" <"$DATABASE" | \ grep -v ', *\(First\|Last\)>' | \ cut -d';' -f1 | \ while read do echo -n "#\x$REPLY " done grep ";$CATEGORY;" <"$DATABASE" | \ grep ', *\(First\|Last\)>' | \ cut -d';' -f1 | \ while read do FIRST=$REPLY read LAST=$REPLY echo -n "(#\x$FIRST . #\x$LAST) " done } echo ')))' echo "(display \"(define char-set:category/$CATEGORY\")(newline)" echo "(char-set-write (apply char-set category-$CATEGORY))(newline)" echo '(display ")")(newline)' echo done ### end of file
For example, the output for the Co
category, which has only
ranges, is (reformatted to look human readable):
(define category-Co (quote ((#\xE000 . #\xF8FF) (#\xF0000 . #\xFFFFD) (#\x100000 . #\x10FFFD)))) (display "(define char-set:category/Co") (newline) (char-set-write (apply char-set category-Co)) (newline) (display ")") (newline)
Next: char-sets sets basic, Previous: char-sets sets intro, Up: char-sets sets [Index]