Previous: , Up: silex semantics   [Index]


53.6.2 Matching the rules

All lexical analysers generated by SILex are interactive. That is, they read as few characters as possible to get the longest match. This is a useful property when the input is coming from a terminal. A lexical analyser is normally based on a finite automaton; it is the case for the analysers generated by SILex. A non–interactive analyser always needs an extra character to provoke an invalid transition in the automaton. The longest match is detected this way. With an interactive analyser, an extra character is not required when it is impossible to obtain a longer match.

A lexical analyser generated by SILex does not impose any a priori limit on the size of the lexemes. The internal buffer is extended each time it is necessary.

Each time the analyser is asked to return a token, it tries to match a prefix of the input with a pattern. There may be more than one possible match; when it is the case, we say there is a conflict. For example, suppose we have those regular expressions:

begin
[a-z]*

and the input is ‘beginning1 ’. We have a match with the first expression and we have many different matches with the second. To resolve such a conflict, the longest match is chosen. So the chosen match is the one between the lexeme ‘beginning’ and the second pattern.

Suppose we have the same regular expressions but the input is ‘begin+ ’. We have two longest match. This conflict is resolved by choosing the first pattern that allows a longest match. So the chosen match is between the lexeme ‘begin’ and the first pattern.

The analyser generated by SILex allows the empty lexeme to be matched if there is no longer match. However, we should take care not to call the analyser again without consuming at least one character of the input: it would cause an infinite loop.

The pattern ‘<<EOF>>’ is matched when the analyser is called and the input system is at end of input. In this situation, the marker is matched even if there is a pattern that matches the empty lexeme. The analyser can be called again and again and the ‘<<EOF>>’ pattern will be matched each time, causing its corresponding action to be evaluated each time, too.

The pattern ‘<<ERROR>>’ is matched when the input system is not at end of input and no other match is possible. Depending on the action associated with this pattern, our program may choose to stop or choose to try to recover from the error. To recover from the error, our program has to read some characters from the input before it can call the analyser again.

As example of error recovery consider the following code which just shows the mechanism:

#!r6rs
(import (nausicaa)
  (prefix (nausicaa parser-tools lexical-tokens) lt.)
  (prefix (vicare parser-tools silex) lex.)
  (prefix (vicare parser-tools silex lexer) lex.))

(define description "%%
A          (lt.<lexical-token>
             ((lt.category: 'A)
              (lt.location: (lt.<source-location>
                              ((lt.input:  #f)
                               (lt.line:   yyline)
                               (lt.column: yycolumn)
                               (lt.offset: yyoffset))))
              (lt.value:    yytext)
              (lt.length:   (string-length yytext))))

<<EOF>>    (lt.<lexical-token>
             ((lt.category: '*eoi*)
              (lt.location: (lt.<source-location>
                              ((lt.input:  #f)
                               (lt.line:   yyline)
                               (lt.column: yycolumn)
                               (lt.offset: yyoffset))))
              (lt.value:    (eof-object))
              (lt.length:   1)))

<<ERROR>>  (lt.<lexical-token>
             ((lt.category: '*lexer-error*)
              (lt.location: (lt.<source-location>
                              ((lt.input:  #f)
                               (lt.line:   yyline)
                               (lt.column: yycolumn)
                               (lt.offset: yyoffset))))
              (lt.value:    yytext)
              (lt.length:   (string-length yytext))))
")

(define table
  (lex.lex
    (lex.input-string:     description)
    (lex.counters:         'all)
    (lex.library-language: '(vicare))
    (lex.library-imports:
      '((prefix (nausicaa parser-tools lexical-token) lt.)))
    (lex.output-value:     #t)
    (lex.lexer-format:     'decision-tree)))

;; correct string
(let* ((IS    (lex.make-IS
                (lex.string: "AAA")
                (lex.counters: 'all)))
       (lexer (lex.make-lexer table IS)))
  (let (((T1 lt.<lexical-token>) (lexer))
        ((T2 lt.<lexical-token>) (lexer))
        ((T3 lt.<lexical-token>) (lexer))
        ((T4 lt.<lexical-token>) (lexer)))
    (list (T1 category) (T2 category)
          (T3 category) (T4 category))))
⇒ (A A A *eoi*)

;; lexer error
(let* ((IS    (lex.make-IS
                (lex.string: "AAAB")
                (lex.counters: 'all)))
       (lexer (lex.make-lexer table IS)))
  (let (((T1 lt.<lexical-token>) (lexer))
        ((T2 lt.<lexical-token>) (lexer))
        ((T3 lt.<lexical-token>) (lexer))
        ((T4 lt.<lexical-token>) (lexer)))
    (list (T1 category) (T2 category)
          (T3 category) (T4 category))))
⇒ (A A A *lexer-error*)

;; lexer error and recovery
(let* ((IS    (lex.make-IS
                (lex.string: "AAABBAA")
                (lex.counters: 'all)))
       (lexer (lex.make-lexer table IS)))
  (let (((T1 lt.<lexical-token>) (lexer))
        ((T2 lt.<lexical-token>) (lexer))
        ((T3 lt.<lexical-token>) (lexer))
        ((T4 lt.<lexical-token>) (lexer)))
    ;; discard invalid characters,
    ;; we know there are 2 of them
    (let ((getc (lex.lexer-get-func-getc IS)))
      (getc)
      (getc))
    (let (((T5 lt.<lexical-token>) (lexer))
          ((T6 lt.<lexical-token>) (lexer))
          ((T7 lt.<lexical-token>) (lexer)))
      (list (T1 category) (T2 category) (T3 category)
            (T4 category)
            (T5 category) (T6 category) (T7 category)))))
⇒ (A A A *lexer-error* A A *eoi*)

Previous: , Up: silex semantics   [Index]