Libraries for Vicare Scheme: srfi regexps syntax boundary

2.39.7.7 Boundary assertions

bos

eos

Matches at the beginning/end of string without consuming any characters (a zero–width assertion). If the search was initiated with start/end parameters, these are considered the end points, rather than the full string.

bol

eol

Matches at the beginning/end of a line without consuming any characters (a zero–width assertion). A line is a possibly empty sequence of characters followed by an end of line sequence as understood by the R7RS read-line procedure, specifically any of a linefeed character, carriage return character, or a carriage return followed by a linefeed character. The string is assumed to contain end of line sequences before the start and after the end of the string, even if the search was made on a substring and the actual surrounding characters differ.

bow

eow

Matches at the beginning/end of a word without consuming any characters (a zero–width assertion). A word is a contiguous sequence of characters that are either alphanumeric or the underscore character, i.e. (or alphanumeric ‘_’), with the definition of alphanumeric depending on the Unicode or ASCII context. The string is assumed to contain non–word characters immediately before the start and after the end, even if the search was made on a substring and word constituent characters appear immediately before the beginning or after the end.

(regexp-search '(: bow "foo") "foo")    ⇒ #<regexp-match>
(regexp-search '(: bow "foo") "")       ⇒ #<regexp-match>
(regexp-search '(: bow "foo") "snafoo") ⇒ #f
(regexp-search '(: "foo" eow) "foo")    ⇒ #<regexp-match>
(regexp-search '(: "foo" eow) "foo!")   ⇒ #<regexp-match>
(regexp-search '(: "foo" eow) "foobar") ⇒ #f

nwb

Matches a non–word–boundary (i.e. ‘\B’ in PCRE). Equivalent to ‘(neg-look-ahead (or bow eow))’.

(word sre ...)

Anchors a sequence to word boundaries. Equivalent to ‘(: bow sre ... eow)’.

(word+ cset-sre ...)

Matches a single word composed of characters in the intersection of the given cset-sre and the word constituent characters. Equivalent to:

(word (+ (and (or alphanumeric "_") (or cset-sre ...))))

word

A shorthand for ‘(word+ any)’.

bog

eog

Matches at the beginning/end of a single extended grapheme cluster without consuming any characters (a zero–width assertion). Grapheme cluster boundaries are defined in Unicode TR29. The string is assumed to contain non–combining code–points immediately before the start and after the end. These always succeed in an ASCII context.

grapheme

Matches a single grapheme cluster (i.e. ‘\X’ in PCRE). This is what the end–user typically thinks of as a single character, comprised of a base non–combining code–point followed by zero or more combining marks. In an ASCII context this is equivalent to any.

Assuming char-set:mark contains all characters with the ‘Extend’ or ‘SpacingMark’ properties defined in TR29, and char-set:control, char-set:regional-indicator and char-set:hangul-* are defined similarly, then the following SRE can be used with regexp-extract to define grapheme:

`(or (: (* ,char-set:hangul-l) (+ ,char-set:hangul-v)
        (* ,char-set:hangul-t))
     (: (* ,char-set:hangul-l) ,char-set:hangul-v
        (* ,char-set:hangul-v) (* ,char-set:hangul-t))
     (: (* ,char-set:hangul-l) ,char-set:hangul-lvt
        (* ,char-set:hangul-t))
     (+ ,char-set:hangul-l)
     (+ ,char-set:hangul-t)
     (+ ,char-set:regional-indicator)
     (: "\r\n")
     (: (~ control ("\r\n"))
        (+ ,char-set:mark))
     control)