Next: , Previous: , Up: irregex   [Index]


50.6 Chunked string matching

It’s often desirable to perform regular expression matching over sequences of characters not represented as a single string. The most obvious example is a text–buffer data structure, but we may also want to match over lists or trees of strings (i.e. ropes), over only certain ranges within a string, over an input port, etc.

With existing regular expression libraries, the only way to accomplish this is by converting the abstract sequence into a freshly allocated string. This can be expensive, or even impossible if the object is a text–buffer opened onto a 500MB file.

(vicare irregex) provides a chunked string API specifically for this purpose.

Function: make-irregex-chunker get-next get-string
Function: make-irregex-chunker get-next get-string get-start
Function: make-irregex-chunker get-next get-string get-start get-end
Function: make-irregex-chunker get-next get-string get-start get-end
Function: make-irregex-chunker get-next get-string get-start get-end get-substring
Function: make-irregex-chunker get-next get-string get-start get-end get-substring get-subchunk

Define a chunking API.

(get-next chunk)

Return the next chunk, or #f if there are no more chunks.

(get-string chunk)

A string source for the chunk.

(get-start chunk)

The start index of the result of get-string (defaults to always 0).

(get-end chunk)

The end (exclusive) of the string (defaults to string-length of the source string).

(get-substring cnk1 i cnk2 j)

A substring for the range between the chunk cnk1 starting at index i and ending at cnk2 at index j.

(get-subchunk cnk1 i cnk2 j)

As above but returns a new chunked data type instead of a string (optional).

There are two important constraints on the get-next procedure. It must return an eq? identical object when called multiple times on the same chunk, and it must not return a chunk with an empty string (start == past). This second constraint is for performance reasons, we push the work of possibly filtering empty chunks to the chunker since there are many chunk types for which empty strings aren’t possible, and this work is thus not needed. Note that the initial chunk passed to match on is allowed to be empty.

get-substring is provided for possible performance improvements, without it a default is used.

get-subchunk is optional, but without it we cannot use irregex-match-subchunk.

Function: irregex-match-subchunk match-obj
Function: irregex-match-subchunk match-obj index-or-name

Generate a chunked data–type for the given match item, of the same type as the underlying chunk type. This is only available if the chunk type specifies the get-subchunk API, otherwise an error is raised.

Function: irregex-search/chunked irx chunker chunk
Function: irregex-search/chunked irx chunker chunk start
Function: irregex-match/chunked irx chunker chunk
Function: irregex-match/chunked irx chunker chunk start

These return normal match–data objects.

Example: To match against a simple, flat list of strings use:

(define (rope->string rope1 start rope2 end)
  (if (eq? rope1 rope2)
      (substring (car rope1) start end)
      (let loop ((rope (cdr rope1))
                 (res (list (substring (car rope1) start))))
         (if (eq? rope rope2)
             (string-concatenate-reverse      ; from SRFI-13
              (cons (substring (car rope) 0 end) res))
             (loop (cdr rope) (cons (car rope) res))))))

(define rope-chunker
  (make-irregex-chunker (lambda (x)
                          (and (pair? (cdr x)) (cdr x)))
                        car
                        (lambda (x)
                          0)
                        (lambda (x)
                          (string-length (car x)))
                        rope->string))

(irregex-search/chunked <pat> rope-chunker <list-of-strings>)

Here we are just using the default start, end and substring behaviors, so the above chunker could simply be defined as:

(define rope-chunker
  (make-irregex-chunker (lambda (x)
                          (and (pair? (cdr x)) (cdr x)))
                        car))
Function: irregex-fold/chunked irx kons knil chunker chunk
Function: irregex-fold/chunked irx kons knil chunker chunk finish
Function: irregex-fold/chunked irx kons knil chunker chunk finish start-index

Chunked version of irregex-fold.


Next: , Previous: , Up: irregex   [Index]