Next: pregexp syntax alternation, Previous: pregexp syntax quantifiers, Up: pregexp syntax [Index]
Clustering, i.e., enclosure within parens (...)
, identifies the
enclosed subpattern as a single entity. It causes the matcher to
capture the submatch, or the portion of the string matching the
subpattern, in addition to the overall match.
(pregexp-match "([a-z]+) ([0-9]+), ([0-9]+)" "jan 1, 1970") ⇒ ("jan 1, 1970" "jan" "1" "1970")
Clustering also causes a following quantifier to treat the entire enclosed subpattern as an entity.
(pregexp-match "(poo )*" "poo poo platter") ⇒ ("poo poo " "poo ")
The number of submatches returned is always equal to the number of subpatterns specified in the regexp, even if a particular subpattern happens to match more than one substring or no substring at all.
(pregexp-match "([a-z ]+;)*" "lather; rinse; repeat;") ⇒ ("lather; rinse; repeat;" " repeat;")
Here the *
–quantified subpattern matches three times, but it is
the last submatch that is returned.
It is also possible for a quantified subpattern to fail to match, even
if the overall pattern matches. In such cases, the failing submatch is
represented by #f
.
(define date-re ;match `month year' or `month day, year'. ;subpattern matches day, if present (pregexp "([a-z]+) +([0-9]+,)? *([0-9]+)")) (pregexp-match date-re "jan 1, 1970") ⇒ ("jan 1, 1970" "jan" "1," "1970") (pregexp-match date-re "jan 1970") ⇒ ("jan 1970" "jan" #f "1970")
Submatches can be used in the insert string argument of the procedures
pregexp-replace
and pregexp-replace*
. The insert string
can use \n
as a backreference to refer back to the n-th
submatch, i.e., the substring that matched the n-th subpattern.
\0
refers to the entire match, and it can also be specified as
\&
.
(pregexp-replace "_(.+?)_" "the _nina_, the _pinta_, and the _santa maria_" "*\\1*") ⇒ "the *nina*, the _pinta_, and the _santa maria_" (pregexp-replace* "_(.+?)_" "the _nina_, the _pinta_, and the _santa maria_" "*\\1*") ⇒ "the *nina*, the *pinta*, and the *santa maria*"
recall: \S
stands for non–whitespace character:
(pregexp-replace "(\\S+) (\\S+) (\\S+)" "eat to live" "\\3 \\2 \\1") ⇒ "live to eat"
Use \\
in the insert string to specify a literal backslash.
Also, \$
stands for an empty string, and is useful for separating
a backreference \n
from an immediately following number.
Backreferences can also be used within the regexp pattern to refer back
to an already matched subpattern in the pattern. \n
stands for
an exact repeat of the n-th submatch.22
(pregexp-match "([a-z]+) and \\1" "billions and billions") ⇒ ("billions and billions" "billions")
Note that the backreference is not simply a repeat of the previous subpattern. Rather it is a repeat of the particular substring already matched by the subpattern.
In the above example, the backreference can only match billions. It
will not match millions, even though the subpattern it harks back to
([a-z]+)
would have had no problem doing so:
(pregexp-match "([a-z]+) and \\1" "billions and millions") ⇒ #f
The following corrects doubled words:
(pregexp-replace* "(\\S+) \\1" "now is the the time for all good men to to come to the aid of of the party" "\\1") ⇒ "now is the time for all good men to come to the aid of the party"
The following marks all immediately repeating patterns in a number string:
(pregexp-replace* "(\\d+)\\1" "123340983242432420980980234" "{\\1,\\1}") ⇒ "12{3,3}40983{24,24}3242{098,098}0234"
It is often required to specify a cluster (typically for quantification)
but without triggering the capture of submatch information. Such
clusters are called non–capturing. In such cases, use (?:
instead of (
as the cluster opener. In the following example,
the non–capturing cluster eliminates the “directory” portion of a
given pathname, and the capturing cluster identifies the basename.
(pregexp-match "^(?:[a-z]*/)*([a-z]+)$" "/usr/local/bin/mzscheme") ⇒ ("/usr/local/bin/mzscheme" "mzscheme")
The location between the ?
and the :
of a non–capturing
cluster is called a cloister.23 We can put modifiers there that will
cause the enclustered subpattern to be treated specially. The modifier
i
causes the subpattern to match case–insensitively:
(pregexp-match "(?i:hearth)" "HeartH") ⇒ ("HeartH")
The modifier x
causes the subpattern to match
space–insensitively, i.e., spaces and comments within the subpattern
are ignored. Comments are introduced as usual with a semicolon
(;
) and extend till the end of the line. If we need to include a
literal space or semicolon in a space–insensitized subpattern, escape
it with a backslash.
(pregexp-match "(?x: a lot)" "alot") ⇒ ("alot") (pregexp-match "(?x: a \\ lot)" "a lot") ⇒ ("a lot") (pregexp-match "(?x: a \\ man \\; \\ ; ignore a \\ plan \\; \\ ; me a \\ canal ; completely )" "a man; a plan; a canal") ⇒ ("a man; a plan; a canal")
The parameter pregexp-comment-char
contains the comment character
(#\;). For Perl–like comments,
(parameterise ((pregexp-comment-char #\#)) ---)
We can put more than one modifier in the cloister.
(pregexp-match "(?ix: a \\ man \\; \\ ; ignore a \\ plan \\; \\ ; me a \\ canal ; completely )" "A Man; a Plan; a Canal") ⇒ ("A Man; a Plan; a Canal")
A minus sign before a modifier inverts its meaning. Thus, we can use
-i
and -x
in a subcluster to overturn the insensitivities
caused by an enclosing cluster.
(pregexp-match "(?i:the (?-i:TeX)book)" "The TeXbook") ⇒ ("The TeXbook")
This regexp will allow any casing for the and book but insists that TeX not be differently cased.
0
, which is
useful in an insert string, makes no sense within the regexp pattern,
because the entire regexp has not matched yet that you could refer back
to it.
A useful, if terminally cute, coinage from the abbots of Perl.
Next: pregexp syntax alternation, Previous: pregexp syntax quantifiers, Up: pregexp syntax [Index]