Next: , Up: pregexp   [Index]


51.1 Introduction

A regexp is a string that describes a pattern. A regexp matcher tries to match this pattern against (a portion of) another string, which we will call the text string. The text string is treated as raw text and not as a pattern.

Most of the characters in a regexp pattern are meant to match occurrences of themselves in the text string. Thus, the pattern abc matches a string that contains the characters a, b, c in succession.

In the regexp pattern, some characters act as metacharacters, and some character sequences act as metasequences. That is, they specify something other than their literal selves. For example, in the pattern a.c, the characters a and c do stand for themselves but the metacharacter . can match any character (other than newline). Therefore, the pattern a.c matches an a, followed by any character, followed by a c.

If we needed to match the character . itself, we escape it, ie, precede it with a backslash (\). The character sequence \. is thus a metasequence, since it doesn’t match itself but rather just .. So, to match a followed by a literal . followed by c, we use the regexp pattern a\\.c.19 Another example of a metasequence is \t, which is a readable way to represent the tab character.

We will call the string representation of a regexp the U-regexp, where U can be taken to mean Unix-style or universal, because this notation for regexps is universally familiar. Our implementation uses an intermediate tree–like representation called the S-regexp, where S can stand for Scheme, symbolic, or s-expression. S-regexps are more verbose and less readable than U-regexps, but they are much easier for Scheme’s recursive procedures to navigate.


Footnotes

(19)

The double backslash is an artifact of Scheme strings, not the regexp pattern itself. When we want a literal backslash inside a Scheme string, we must escape it so that it shows up in the string at all. Scheme strings use backslash as the escape character, so we end up with two backslashes; one Scheme–string backslash to escape the regexp backslash, which then escapes the dot. Another character that would need escaping inside a Scheme string is ".


Next: , Up: pregexp   [Index]