Next: , Previous: , Up: char-sets sets   [Index]


26.6.2 How predefined sets were defined

The following methods were employed to define the character sets.

Inspecting the database

On a Unix–like system, the following commands can be used to inspect the UnicodeData.txt database:

wc -l <UnicodeData.txt

Count the lines in the database: One line for each code point. Notice that not all the code points have a line in this file; the file format allows ranges, so some pairs of lines represent ranges of characters.

cut -d';' -f3 <UnicodeData.txt | sort | uniq

Print the category codes in the database (use sort first, because uniq removes repeated lines only if they are adjacent). Notice that the Cs category (surrogate characters) is present in the database, but excluded from the Vicare libraries, because it describes the range [#xD800, #xDFFF] forbidden by R6RS.

grep ';Cs;' <UnicodeData.txt

Print only the lines describing the surrogate characters. They are six lines representing the three ranges:

[#xD800, #xDB7F]

Non private use high surrogate.

[#xDB80, #xDBFF]

Private use high surrogate.

[#xDC00, #xDFFF]

Low surrogate.

Notice that these ranges are adjacent and their union is the range [#xD800, #xDFFF].

grep -v ';Cs;' <UnicodeData.txt | wc -l

Count the lines excluding the surrogate characters. The count should be 19330 (last verified with the database downloaded Wed Jun 23, 2009).

grep ';Ll;' <UnicodeData.txt

Extract all the lines describing the Ll category.

grep ', *\(First\|Last\)>' <UnicodeData.txt

Extract all the lines describing the inclusive limit of a range of characters.

grep -v ', *\(First\|Last\)>' <UnicodeData.txt

Extract all the lines describing a single code point, excluding the lines describing the limit of a range of characters.

The following Bourne shell script processes the UnicodeData.txt database and prints a Scheme program that, when evaluated, prints the definitions of category character sets. The output script program makes use of the (vicare containers char-sets) library itself.

# unicode-database-extract-category-code-points.sh --
#

DATABASE=${1:?'missing UnicodeData.txt pathname'}

CATEGORY_CODES=$(cut -d';' -f3 <"$DATABASE" | sort | uniq | grep -v Cs)

echo '(import (rnrs) (vicare containers char-sets))'

for CATEGORY in $CATEGORY_CODES
do
    echo processing category $CATEGORY >&2
    echo -n "(define category-$CATEGORY (quote ("
    {
        grep ";$CATEGORY;" <"$DATABASE"   | \
            grep -v ', *\(First\|Last\)>' | \
            cut -d';' -f1                 | \
            while read
        do echo -n "#\x$REPLY "
        done

        grep ";$CATEGORY;" <"$DATABASE"   | \
            grep ', *\(First\|Last\)>'    | \
            cut -d';' -f1                 | \
            while read
        do
            FIRST=$REPLY
            read
            LAST=$REPLY
            echo -n "(#\x$FIRST . #\x$LAST) "
        done
    }
    echo ')))'
    echo "(display \"(define char-set:category/$CATEGORY\")(newline)"
    echo "(char-set-write (apply char-set category-$CATEGORY))(newline)"
    echo '(display ")")(newline)'
    echo
done

### end of file

For example, the output for the Co category, which has only ranges, is (reformatted to look human readable):

(define category-Co (quote ((#\xE000 . #\xF8FF)
                            (#\xF0000 . #\xFFFFD)
                            (#\x100000 . #\x10FFFD))))

(display "(define char-set:category/Co")
(newline)
(char-set-write (apply char-set category-Co))
(newline)
(display ")")
(newline)

Next: , Previous: , Up: char-sets sets   [Index]