Libraries for Vicare Scheme: srfi regexps syntax charsets

2.39.7.5 Character sets

A character set pattern matches a single character.

<char>

A singleton char set.

(regexp-matches '(* #\-) "---")  ⇒ #<regexp-match>
(regexp-matches '(* #\-) "-_-")  ⇒ #f

"<char>"

A singleton char set written as a string of length one rather than a character. Equivalent to its interpretation as a literal string match, but included to clarify it can be composed in cset-sres.

<char-set>

A SRFI-14 character set, which matches any character in the set. Note that currently there is no portable written representation of SRFI-14 character sets, which means that this pattern is typically generated programmatically, such as with a quasiquoted expression.

(regexp-partition `(+ ,char-set:vowels) "vowels")
⇒ ("v" "o" "w" "e" "ls")

RATIONALE Many useful character sets are likely to be available as SRFI-14 char-sets, so it is desirable to reuse them in regular expressions. Since many Unicode character sets are extremely large, converting back and forth between an internal and external representation can be expensive, so the option of direct embedding is necessary. When a readable external representation is needed, char-set->sre can be used.

(char-set <string>)

(<string>)

The set of chars as formed by SRFI-14:

(string->char-set <string>)

Note that char-sets contain code points, not grapheme clusters, so any combining characters in ‘<string>’ will be inserted separately from any preceding base characters by string->char-set.

(regexp-matches '(* ("aeiou")) "oui")       ⇒ #<regexp-match>
(regexp-matches '(* ("aeiou")) "ouais")     ⇒ #f
(regexp-matches '(* ("e\x0301")) "e\x0301") ⇒ #<regexp-match>
(regexp-matches '("e\x0301") "e\x0301")     ⇒ #f
(regexp-matches '("e\x0301") "e")           ⇒ #<regexp-match>
(regexp-matches '("e\x0301") "\x0301")      ⇒ #<regexp-match>
(regexp-matches '("e\x0301") "\x00E9")      ⇒ #f

(char-range <range-spec> ...)

(/ <range-spec> ...)

Ranged char set. The ‘<range-spec>’ is a list of strings and characters. These are flattened and grouped into pairs of characters, and all ranges formed by the pairs are included in the char set.

(regexp-matches '(* (/ "AZ09")) "R2D2")  ⇒ #<regexp-match>
(regexp-matches '(* (/ "AZ09")) "C-3PO") ⇒ #f

(or <cset-sre> ...)

(|\|| <cset-sre> ...)

Char set union. The single vertical bar form is provided for consistency and compatibility with SCSH, although it needs to be escaped in R7RS.

NOTE The syntax ‘|\||’ is not supported by Vicare.

(complement <cset-sre> ...)

(~ <cset-sre> ...)

Char set complement (i.e. ‘[^...]’ in PCRE notation). ‘(~ x)’ is equivalent to ‘(- any x)’, thus in an ASCII context the complement is always ASCII.

(difference <cset-sre> ...)

(- <cset-sre> ...)

Char set difference.

(regexp-matches '(* (- (/ "az") ("aeiou"))) "xyzzy")
⇒ #<regexp-match>

(regexp-matches '(* (- (/ "az") ("aeiou"))) "vowels")
⇒ #f

(and <cset-sre> ...)

(& <cset-sre> ...)

Char set intersection.

(regexp-matches '(* (& (/ "az") (~ ("aeiou")))) "xyzzy")
⇒ #<regexp-match>

(regexp-matches '(* (& (/ "az") (~ ("aeiou")))) "vowels")
 ⇒ #f