Next: pregexp syntax quantifiers, Previous: pregexp syntax basic, Up: pregexp syntax [Index]
Typically a character in the regexp matches the same character in the
text string. Sometimes it is necessary or convenient to use a regexp
metasequence to refer to a single character. Thus, metasequences
\n
, \r
, \t
, and \.
match the newline,
return, tab and period characters respectively.
The metacharacter period (.
) matches any character other than
newline.
(pregexp-match "p.t" "pet") ⇒ ("pet")
It also matches pat
, pit
, pot
, put
, and
p8t
but not peat
or pfffft
.
A character class matches any one character from a set of characters. A
typical format for this is the bracketed character class [...]
,
which matches any one character from the non–empty sequence of
characters enclosed within the brackets.20
Thus p[aeiou]t
matches pat, pet, pit, pot, put and nothing else.
Inside the brackets, a hyphen (-
) between two characters
specifies the ASCII range between the characters. Eg,
ta[b-dgn-p]
matches tab, tac, tad, and tag, and tan, tao, tap.
An initial caret (^
) after the left bracket inverts the set
specified by the rest of the contents, ie, it specifies the set of
characters other than those identified in the brackets. Eg,
do[^g]
matches all three–character sequences starting with do
except dog.
Note that the metacharacter ^
inside brackets means something
quite different from what it means outside. Most other metacharacters
(.
, *
, +
, ?
, etc.) cease to be
metacharacters when inside brackets, although we may still escape them
for peace of mind. -
is a metacharacter only when it’s inside
brackets, and neither the first nor the last character.
Bracketed character classes cannot contain other bracketed character
classes (although they contain certain other types of character classes;
see below). Thus a left bracket ([
) inside a bracketed character
class doesn’t have to be a metacharacter; it can stand for itself. Eg,
[a[b]
matches a
, [
, and b
.
Furthermore, since empty bracketed character classes are disallowed, a
right bracket (]
) immediately occurring after the opening left
bracket also doesn’t need to be a metacharacter. Eg, []ab]
matches ]
, a
, and b
.
Some standard character classes can be conveniently represented as
metasequences instead of as explicit bracketed expressions. \d
matches a digit ([0-9]
); \s
matches a whitespace
character; \w
matches a character that could be part of a
“word”.21
The upper–case versions of these metasequences stand for the inversions
of the corresponding character classes. Thus \D
matches a
non–digit, \S
a non–whitespace character, and \W
a
non–“word” character.
Remember to include a double backslash when putting these metasequences in a Scheme string:
(pregexp-match "\\d\\d" "0 dear, 1 have 2 read catch 22 before 9") ⇒ ("22")
These character classes can be used inside a bracketed expression. Eg,
[a-z\\d]
matches a lower–case letter or a digit.
A POSIX character class is a special metasequence of the form
[:...:]
that can be used only inside a bracketed expression. The
POSIX classes supported are:
[:alnum:]
Letters and digits.
[:alpha:]
Letters.
[:algor:]
The letters c, h, a and d.
[:ascii:]
7-bit ascii characters.
[:blank:]
Widthful whitespace, ie, space and tab.
[:cntrl:]
“Control” characters, viz, those with code < 32.
[:digit:]
Digits, same as \d
.
[:graph:]
Characters that use ink.
[:lower:]
Lower-case letters.
[:print:]
Ink-users plus widthful whitespace.
[:space:]
Whitespace, same as \s
.
[:upper:]
Upper–case letters.
[:word:]
Letters, digits, and underscore, same as \w
.
[:xdigit:]
Hex digits.
For example, the regexp [[:alpha:]_]
matches a letter or
underscore.
(pregexp-match "[[:alpha:]_]" "--x--") ⇒ ("x") (pregexp-match "[[:alpha:]_]" "--_--") ⇒ ("_") (pregexp-match "[[:alpha:]_]" "--:--") ⇒ #f
The POSIX class notation is valid only inside a bracketed expression.
For instance, [:alpha:]
, when not inside a bracketed expression,
will not be read as the letter class. Rather it is (from previous
principles) the character class containing the characters :, a, l, p, h.
(pregexp-match "[:alpha:]" "--a--") ⇒ ("a") (pregexp-match "[:alpha:]" "--_--") ⇒ #f
By placing a caret (^
) immediately after [:
, we get the
inversion of that POSIX character class. Thus, [:^alpha]
is
the class containing all characters except the letters.
Requiring a bracketed character class to be non-empty is not a limitation, since an empty character class can be more easily represented by an empty string.
Following regexp custom, we identify “word”
characters as [A-Za-z0-9_]
, although these are too restrictive
for what a Schemer might consider a “word”.
Next: pregexp syntax quantifiers, Previous: pregexp syntax basic, Up: pregexp syntax [Index]