Previous: irregex sre assertion, Up: irregex sre [Index]
The following utility regular expressions are also provided for common patterns that people are eternally reinventing. They are not necessarily the official patterns matching the RFC definitions of the given data, because of the way that such patterns tend to be used. There are three general usages for regexps:
Search for a pattern matching a desired object in a larger text.
Determine whether an entire string matches a pattern.
Given a string already known to be valid, extract certain fields from it as submatches.
In some cases, but not always, these will overlap. When they are
different, irregex-search
will naturally always want the
searching version, so (vicare irregex)
provides that version.
As an example where these might be different, consider an URL. If we want to match all the URLs in some arbitrary text, we probably want to exclude a period or comma at the tail end of an URL, since it’s more likely being used as punctuation rather than part of the URL, despite the fact that it would be valid URL syntax.
Another problem with the RFC definitions is the standard itself may
have become irrelevant. For example, the pattern (vicare
irregex)
provides for email addresses doesn’t match quoted local parts
(e.g. "first last"@domain.com
) because these are increasingly
rare, and unsupported by enough software that it’s better to discourage
their use. Conversely, technically consecutive periods
(e.g. first..last@domain.com
) are not allowed in email
addresses, but most email software does allow this, and in fact such
addresses are quite common in Japan.
The current patterns provided are:
newline
General newline pattern (crlf
, cr
, lf
).
integer
An integer.
real
A real number (including scientific).
string
A “quoted” string.
symbol
An R6RS Scheme symbol.
ipv4-address
A numeric decimal IPv4 address.
ipv6-address
A numeric hexadecimal IPv6 address.
domain
A domain name.
email
An email address.
http-url
A URL beginning with https?://
.
Because of these issues the exact definitions of these patterns are subject to change, but will be documented clearly when they are finalized. More common patterns are also planned, but as what we want increases in complexity it’s probably better to use a real parser.
Previous: irregex sre assertion, Up: irregex sre [Index]