Libraries for Vicare Scheme: srfi regexps procs

2.39.6 Library procedures and syntax

Function: regexp regexp re

Compiles a regexp if given an object whose structure matches the SRE syntax. This may be written as a literal or partial literal with quote or quasiquote, or may be generated entirely programmatically. Return re unmodified if it is already a regexp. Raise an error if re is neither a regexp nor a valid representation of an SRE.

Mutating re may invalidate the resulting regexp, causing unspecified results if subsequently used for matching.

Syntax: regexp rx sre …: Macro shorthand for (regexp `(: sre ...)). May be able to perform some or all computation at compile time if sre is not unquoted.

NOTE Because of this equivalence with the procedural constructor regexp, the semantics of unquote differs from the original SCSH implementation in that unquoted expressions can expand into any object matching the SRE syntax, but not a compiled regexp object. Further, unquote and unquote-splicing both expand all matches.

RATIONALE Providing a procedural interface provides for greater flexibility, and without loss of potential compile–time optimizations by preserving the syntactic shorthand. The alternative is to rely on eval to dynamically generate regular expressions. However regexps in many cases come from untrusted sources, such as search parameters to a server, or from serialized sources such as config files or command–line arguments. Moreover many applications may want to keep many thousands of regexps in memory at once. Given the relatively heavy cost and insecurity of eval, and the frequency with which SREs are read and written as text, we prefer the procedural interface.

Function: sre regexp->sre re: Return an SRE corresponding to the given regexp re. The SRE will be equivalent to (will match the same strings) but not necessarily equal? to the SRE originally used to compile re. Mutating the result may invalidate re, causing unspecified results if subsequently used for matching.

Function: sre char-set->sre char-set: Return an SRE corresponding to the given SRFI-14 character set. The resulting SRE expands the character set into notation which does not make use of embedded SRFI-14 character sets, and so is suitable for writing portably.

Function: boolean valid-sre? obj: Return true if, and only if, obj can be safely passed to regexp.

Function: boolean regexp? obj: Return true if, and only if, obj is a regexp.

Function: regexp-match-or-false regexp-matches re str
Function: regexp-match-or-false regexp-matches re str start
Function: regexp-match-or-false regexp-matches re str start end: Return a regexp-match object if re successfully matches the entire string str from start (inclusive) to end (exclusive), or #f is the match fails. The regexp-match object will contain information needed to extract any submatches.

Function: boolean regexp-matches? re str
Function: boolean regexp-matches? re str start
Function: boolean regexp-matches? re str start end: Return #t if re matches str as in regexp-matches, or #f otherwise. May be faster than regexp-matches since it doesn’t need to return submatch data.

Function: regexp-match-or-false regexp-search re str
Function: regexp-match-or-false regexp-search re str start
Function: regexp-match-or-false regexp-search re str start end: Return a regexp-match object if re successfully matches a substring of str between start (inclusive) and end (exclusive), or #f if the match fails. The regexp-match object will contain information needed to extract any submatches.

Function: obj regexp-fold re kons knil str

Function: obj regexp-fold re kons knil str finish

Function: obj regexp-fold re kons knil str finish start

Function: obj regexp-fold re kons knil str finish start end

The fundamental regexp matching iterator. Repeatedly search str for the regexp re so long as a match can be found. On each successful match, applies:

(kons i regexp-match str acc)

where i is the index since the last match (beginning with start), regexp-match is the resulting match, and acc is the result of the previous kons application, beginning with knil. When no more matches can be found, calls finish with the same arguments, except that regexp-match is #f.

By default finish just returns acc.

(regexp-fold 'word
   (lambda (i m str acc)
     (let ((s (regexp-match-submatch m 0)))
      (cond ((assoc s acc)
             => (lambda (x) (set-cdr! x (+ 1 (cdr x))) acc))
            (else `((,s . 1) ,@acc)))))
   '()
   "to be or not to be")
⇒ (("not" . 1) ("or" . 1) ("be" . 2) ("to" . 2))

Function: list regexp-extract re str

Function: list regexp-extract re str start

Function: list regexp-extract re str start end

Extract all the non–empty substrings of str which match re between start and end as a list of strings.

(regexp-extract '(+ numeric) "192.168.0.1")
⇒ ("192" "168" "0" "1")

Function: list regexp-split re str

Function: list regexp-split re str start

Function: list regexp-split re str start end

Split str into a list of strings separated by matches of re.

(regexp-split '(+ space) " fee fi  fo\tfum\n")
⇒ ("fee" "fi" "fo" "fum")

Function: list regexp-partition re str

Function: list regexp-partition re str start end

Function: list regexp-partition re str start

Partition str into a list of non–empty strings matching re, interspersed with the unmatched portions of the string str. The first and every odd element is an unmatched substring, which will be the empty string if re matches at the beginning of the string or end of the previous match. The second and every even element will be a substring matching re. If the final match ends at the end of the string, no trailing empty string will be included. Thus, in the degenerate case where str is the empty string, the result is ‘("")’.

(regexp-partition '(+ (or space punct)) "")
⇒ ("")

(regexp-partition '(+ (or space punct)) "Hello, world!\n")
⇒ ("Hello" ", " "world" "!\n")

(regexp-partition '(+ (or space punct)) "¿Dónde Estás?")
⇒ ("" "¿" "Dónde" " " "Estás" "?")

Function: string regexp-replace re str subst

Function: string regexp-replace re str subst start

Function: string regexp-replace re str subst start end

Function: string regexp-replace re str subst start end count

Return a new string replacing the countth match of re in str with the subst, where the zero–indexed count defaults to zero (i.e. the first match). If there are not count matches, return the selected substring unmodified.

subst can be a string, an integer or symbol indicating the contents of a numbered or named submatch of re, ‘pre’ for the substring to the left of the match, or ‘post’ for the substring to the right of the match.

The optional parameters start and end restrict both the matching and the substitution, to the given indices, such that the result is equivalent to omitting these parameters and replacing on (substring str start end). As a convenience, a value of #f for end is equivalent to (string-length str).

(regexp-replace '(+ space) "one two three" "_")
⇒ "one_two three"

(regexp-replace '(+ space) "one two three" "_" 0 #f 0)
⇒ "one_two three"

(regexp-replace '(+ space) "one two three" "_" 0 #f 1)
⇒ "one two_three"

(regexp-replace '(+ space) "one two three" "_" 0 #f 2)
⇒ "one two three"

Function: string regexp-replace-all re str subst

Function: string regexp-replace-all re str subst start

Function: string regexp-replace-all re str subst start end

Equivalent to regexp-replace, but replaces all occurrences of re in str.

(regexp-replace-all '(+ space) "one two three" "_")
⇒ "one_two_three"

Function: boolean regexp-match? obj

Return true if, and only if, obj is a successful match from regexp-matches or regexp-search.

(regexp-match? (regexp-matches "x" "x"))  ⇒ #t
(regexp-match? (regexp-matches "x" "y"))  ⇒ #f

Function: integer regexp-match-count regexp-match

Return the number of submatches of regexp-match, regardless of whether they matched or not. Do not include the implicit zero full match in the count.

(regexp-match-count (regexp-matches "x" "x"))       ⇒ 0
(regexp-match-count (regexp-matches '($ "x") "x"))  ⇒ 1

Function: string-or-false regexp-match-submatch regexp-match field

Return the substring matched in regexp-match corresponding to field, either an integer or a symbol for a named submatch. Index ‘0’ refers to the entire match, index ‘1’ to the first lexicographic submatch, and so on. If there are multiple submatches with the same name, the first which matched is returned. If passed an integer outside the range of matches, or a symbol which does not correspond to a named submatch of the pattern, an error is raised. If the corresponding submatch did not match, return #f.

The result of extracting a submatch after the original matched string has been mutated is unspecified.

(regexp-match-submatch (regexp-search 'word "**foo**") 0)
⇒ "foo"

(regexp-match-submatch
  (regexp-search '(: "*" ($ word) "*") "**foo**") 0)
⇒ "*foo*"

(regexp-match-submatch
  (regexp-search '(: "*" ($ word) "*") "**foo**") 1)
⇒ "foo"

Function: integer-or-false regexp-match-submatch-start regexp-match field

Return the start index in regexp-match corresponding to field, as in regexp-match-submatch.

(regexp-match-submatch-start
  (regexp-search 'word "**foo**") 0)
⇒ 2

(regexp-match-submatch-start
  (regexp-search '(: "*" ($ word) "*") "**foo**") 0)
⇒ 1

(regexp-match-submatch-start
  (regexp-search '(: "*" ($ word) "*") "**foo**") 1)
⇒ 2

Function: integer-or-false regexp-match-submatch-end regexp-match field

Return the end index in regexp-match corresponding to field, as in regexp-match-submatch.

(regexp-match-submatch-end
  (regexp-search 'word "**foo**") 0)
⇒ 5

(regexp-match-submatch-end
  (regexp-search '(: "*" ($ word) "*") "**foo**") 0)
⇒ 6

(regexp-match-submatch-end
  (regexp-search '(: "*" ($ word) "*") "**foo**") 1)
⇒ 5

Function: list regexp-match->list regexp-match

Return a list of all submatches in regexp-match as string or #f, beginning with the entire match ‘0’.

(regexp-match->list
  (regexp-search '(: ($ word) (+ (or space punct)) ($ word))
                 "cats & dogs"))
⇒ ("cats & dogs" "cats" "dogs")