Next: irregex misc, Previous: irregex replace, Up: irregex [Index]
It’s often desirable to perform regular expression matching over sequences of characters not represented as a single string. The most obvious example is a text–buffer data structure, but we may also want to match over lists or trees of strings (i.e. ropes), over only certain ranges within a string, over an input port, etc.
With existing regular expression libraries, the only way to accomplish this is by converting the abstract sequence into a freshly allocated string. This can be expensive, or even impossible if the object is a text–buffer opened onto a 500MB file.
(vicare irregex)
provides a chunked string API specifically
for this purpose.
Define a chunking API.
(get-next chunk)
Return the next chunk, or #f
if there are no more chunks.
(get-string chunk)
A string source for the chunk.
(get-start chunk)
The start index of the result of get-string (defaults to always 0).
(get-end chunk)
The end (exclusive) of the string (defaults to string-length
of
the source string).
(get-substring cnk1 i cnk2 j)
A substring for the range between the chunk cnk1 starting at index i and ending at cnk2 at index j.
(get-subchunk cnk1 i cnk2 j)
As above but returns a new chunked data type instead of a string (optional).
There are two important constraints on the get-next procedure. It
must return an eq?
identical object when called multiple times on
the same chunk, and it must not return a chunk with an empty string
(start == past). This second constraint is for performance reasons, we
push the work of possibly filtering empty chunks to the chunker since
there are many chunk types for which empty strings aren’t possible, and
this work is thus not needed. Note that the initial chunk passed to
match on is allowed to be empty.
get-substring is provided for possible performance improvements, without it a default is used.
get-subchunk is optional, but without it we cannot use
irregex-match-subchunk
.
Generate a chunked data–type for the given match item, of the same type
as the underlying chunk type. This is only available if the chunk type
specifies the get-subchunk
API, otherwise an error is raised.
These return normal match–data objects.
Example: To match against a simple, flat list of strings use:
(define (rope->string rope1 start rope2 end) (if (eq? rope1 rope2) (substring (car rope1) start end) (let loop ((rope (cdr rope1)) (res (list (substring (car rope1) start)))) (if (eq? rope rope2) (string-concatenate-reverse ; from SRFI-13 (cons (substring (car rope) 0 end) res)) (loop (cdr rope) (cons (car rope) res)))))) (define rope-chunker (make-irregex-chunker (lambda (x) (and (pair? (cdr x)) (cdr x))) car (lambda (x) 0) (lambda (x) (string-length (car x))) rope->string)) (irregex-search/chunked <pat> rope-chunker <list-of-strings>)
Here we are just using the default start, end and substring behaviors, so the above chunker could simply be defined as:
(define rope-chunker (make-irregex-chunker (lambda (x) (and (pair? (cdr x)) (cdr x))) car))
Chunked version of irregex-fold
.
Next: irregex misc, Previous: irregex replace, Up: irregex [Index]