Next: char-sets sets basic, Previous: char-sets sets intro, Up: char-sets sets [Index]
The following methods were employed to define the character sets.
(vicare containers
char-sets) library itself.
(vicare
containers char-sets blocks) library (there are a lot of them).
Ll, Lu, etc.) has been converted to
a set; they are exported by the (vicare containers char-sets
categories) library (there are a lot of them). Notice that not all the
code points accepted by char=? are part of a general category, so
the union of the category sets is different from the full character set.
On a Unix–like system, the following commands can be used to inspect the UnicodeData.txt database:
wc -l <UnicodeData.txtCount the lines in the database: One line for each code point. Notice that not all the code points have a line in this file; the file format allows ranges, so some pairs of lines represent ranges of characters.
cut -d';' -f3 <UnicodeData.txt | sort | uniqPrint the category codes in the database (use sort first,
because uniq removes repeated lines only if they are
adjacent). Notice that the Cs category (surrogate characters) is
present in the database, but excluded from the Vicare libraries,
because it describes the range [#xD800, #xDFFF] forbidden by
R6RS.
grep ';Cs;' <UnicodeData.txtPrint only the lines describing the surrogate characters. They are six lines representing the three ranges:
[#xD800, #xDB7F]Non private use high surrogate.
[#xDB80, #xDBFF]Private use high surrogate.
[#xDC00, #xDFFF]Low surrogate.
Notice that these ranges are adjacent and their union is the range
[#xD800, #xDFFF].
grep -v ';Cs;' <UnicodeData.txt | wc -lCount the lines excluding the surrogate characters. The count should be 19330 (last verified with the database downloaded Wed Jun 23, 2009).
grep ';Ll;' <UnicodeData.txtExtract all the lines describing the Ll category.
grep ', *\(First\|Last\)>' <UnicodeData.txtExtract all the lines describing the inclusive limit of a range of characters.
grep -v ', *\(First\|Last\)>' <UnicodeData.txtExtract all the lines describing a single code point, excluding the lines describing the limit of a range of characters.
The following Bourne shell script processes the UnicodeData.txt
database and prints a Scheme program that, when evaluated, prints the
definitions of category character sets. The output script program makes
use of the (vicare containers char-sets) library itself.
# unicode-database-extract-category-code-points.sh --
#
DATABASE=${1:?'missing UnicodeData.txt pathname'}
CATEGORY_CODES=$(cut -d';' -f3 <"$DATABASE" | sort | uniq | grep -v Cs)
echo '(import (rnrs) (vicare containers char-sets))'
for CATEGORY in $CATEGORY_CODES
do
echo processing category $CATEGORY >&2
echo -n "(define category-$CATEGORY (quote ("
{
grep ";$CATEGORY;" <"$DATABASE" | \
grep -v ', *\(First\|Last\)>' | \
cut -d';' -f1 | \
while read
do echo -n "#\x$REPLY "
done
grep ";$CATEGORY;" <"$DATABASE" | \
grep ', *\(First\|Last\)>' | \
cut -d';' -f1 | \
while read
do
FIRST=$REPLY
read
LAST=$REPLY
echo -n "(#\x$FIRST . #\x$LAST) "
done
}
echo ')))'
echo "(display \"(define char-set:category/$CATEGORY\")(newline)"
echo "(char-set-write (apply char-set category-$CATEGORY))(newline)"
echo '(display ")")(newline)'
echo
done
### end of file
For example, the output for the Co category, which has only
ranges, is (reformatted to look human readable):
(define category-Co (quote ((#\xE000 . #\xF8FF)
(#\xF0000 . #\xFFFFD)
(#\x100000 . #\x10FFFD))))
(display "(define char-set:category/Co")
(newline)
(char-set-write (apply char-set category-Co))
(newline)
(display ")")
(newline)
Next: char-sets sets basic, Previous: char-sets sets intro, Up: char-sets sets [Index]