Previous: , Up: irregex sre   [Index]


50.9.6 Utility patterns

The following utility regular expressions are also provided for common patterns that people are eternally reinventing. They are not necessarily the official patterns matching the RFC definitions of the given data, because of the way that such patterns tend to be used. There are three general usages for regexps:

Searching

Search for a pattern matching a desired object in a larger text.

Validation

Determine whether an entire string matches a pattern.

Extraction

Given a string already known to be valid, extract certain fields from it as submatches.

In some cases, but not always, these will overlap. When they are different, irregex-search will naturally always want the searching version, so (vicare irregex) provides that version.

As an example where these might be different, consider an URL. If we want to match all the URLs in some arbitrary text, we probably want to exclude a period or comma at the tail end of an URL, since it’s more likely being used as punctuation rather than part of the URL, despite the fact that it would be valid URL syntax.

Another problem with the RFC definitions is the standard itself may have become irrelevant. For example, the pattern (vicare irregex) provides for email addresses doesn’t match quoted local parts (e.g. "first last"@domain.com) because these are increasingly rare, and unsupported by enough software that it’s better to discourage their use. Conversely, technically consecutive periods (e.g. first..last@domain.com) are not allowed in email addresses, but most email software does allow this, and in fact such addresses are quite common in Japan.

The current patterns provided are:

newline

General newline pattern (crlf, cr, lf).

integer

An integer.

real

A real number (including scientific).

string

A “quoted” string.

symbol

An R6RS Scheme symbol.

ipv4-address

A numeric decimal IPv4 address.

ipv6-address

A numeric hexadecimal IPv6 address.

domain

A domain name.

email

An email address.

http-url

A URL beginning with https?://.

Because of these issues the exact definitions of these patterns are subject to change, but will be documented clearly when they are finalized. More common patterns are also planned, but as what we want increases in complexity it’s probably better to use a real parser.


Previous: , Up: irregex sre   [Index]