Next: pregexp api, Up: pregexp [Index]
A regexp is a string that describes a pattern. A regexp matcher tries to match this pattern against (a portion of) another string, which we will call the text string. The text string is treated as raw text and not as a pattern.
Most of the characters in a regexp pattern are meant to match
occurrences of themselves in the text string. Thus, the pattern
abc
matches a string that contains the characters a
,
b
, c
in succession.
In the regexp pattern, some characters act as metacharacters, and some
character sequences act as metasequences. That is, they specify
something other than their literal selves. For example, in the pattern
a.c
, the characters a
and c
do stand for themselves
but the metacharacter .
can match any character (other than
newline). Therefore, the pattern a.c
matches an a
,
followed by any character, followed by a c
.
If we needed to match the character .
itself, we escape it, ie,
precede it with a backslash (\
). The character sequence
\.
is thus a metasequence, since it doesn’t match itself but
rather just .
. So, to match a
followed by a literal
.
followed by c
, we use the regexp pattern
a\\.c
.19 Another example of a metasequence is \t
, which is a
readable way to represent the tab character.
We will call the string representation of a regexp the U-regexp, where U can be taken to mean Unix-style or universal, because this notation for regexps is universally familiar. Our implementation uses an intermediate tree–like representation called the S-regexp, where S can stand for Scheme, symbolic, or s-expression. S-regexps are more verbose and less readable than U-regexps, but they are much easier for Scheme’s recursive procedures to navigate.
The double backslash is an artifact of Scheme
strings, not the regexp pattern itself. When we want a literal
backslash inside a Scheme string, we must escape it so that it shows up
in the string at all. Scheme strings use backslash as the escape
character, so we end up with two backslashes; one Scheme–string
backslash to escape the regexp backslash, which then escapes the dot.
Another character that would need escaping inside a Scheme string is
"
.
Next: pregexp api, Up: pregexp [Index]