Using a modal lexer for parsing sublanguages

Defreyne, Denis

Using a modal lexer for parsing sublanguages

Published March 2022

a 10-minute read

(2300 words)

How do you write a lexer for print "one plus two is ${1+2}";? The string interpolation part makes this tricky, as the string forms a sublanguage that is different from the main language. To make things worse, interpolation can be nested. Textbook lexers won’t cut it, and we need something stronger.

Enter modal lexers. These lexers can switch between different modes for lexing different languages. In this article, I’ll run you through an example of how to build such a modal lexer.

For this article, I’m using Ruby as the programming language. The same technique will work in other programming languages too. No external dependencies (gems) are used.

Ruby’s string scanner
A basic lexer
1. Testing it out
A modal lexer
Hints for the parser
Wrapping up
References

Ruby’s string scanner

Ruby provides a StringScanner class, which is is an excellent building block for lexers. It offers a clean API to split up a string into individual parts.

I’ll run you through a simple example, so that you can get a feel of how it works. First, we require strscan, and then we instantiate the StringScanner with the string to scan:

require "strscan"

str = "let amount = 5;"
scanner = StringScanner.new(str)

The #scan method, when given a string part, will attempt to match this given string part:

scanner.scan("let") # => "let"

The scanner starts at the beginning of the string. When a match is successful, the scanner will move forward. Because we just matched "let", the scanner has moved past this "let", and attempting to match it again will fail:

scanner.scan("let") # => nil

The #scan method returns the matched string part, or nil if there is no match. In the latter case, the scanner stays at its current position.

This method can also take a regular expression:

scanner.scan(/\s+/) # => " "

The regular expression \s+ matches one or more whitespace characters, such as spaces, tabs, and newlines.

Now that we’ve matched "let" and a space character, we can match the next part: an identifier, consisting of alphanumeric characters. For this, the regular expression \w+ comes in handy, which matches one or more alphanumeric characters:

scanner.scan(/\w+/) # => "amount"

Another useful method is #matched, which returns the matched string part:

scanner.matched # => "amount"

The StringScanner provides more functionality, but #scan and #matched is all we need to write a lexer.

A basic lexer

Before writing a modal lexer, we’ll start with a basic lexer:

class BasicLexer
  def initialize(str)
    @scanner = StringScanner.new(str)
  end

This BasicLexer is constructed with a string. The string itself isn’t important; we’ll only use the StringScanner.

  def next_token

The only public method for this lexer is #next_token, which will return a single token, or nil when there are no more tokens.

    if @scanner.eos?
      nil

If the scanner is at the end of the string (eos stands for end-of-string), we’ll return nil.

    elsif @scanner.scan(/\s+/)
      [:WHITESPACE, nil]

If we match one or more whitespace characters (\s+), we return a WHITESPACE token.

Tokens are arrays with two elements: first, the token type as an uppercase symbol, and second, an optional value. The value for this WHITESPACE token is nil; we won’t use the value for anything.

    elsif @scanner.scan("+")
      [:PLUS, nil]
    elsif @scanner.scan("*")
      [:TIMES, nil]
    elsif @scanner.scan("-")
      [:MINUS, nil]
    elsif @scanner.scan("/")
      [:DIVIDE, nil]
    elsif @scanner.scan(";")
      [:SEMICOLON, nil]
    elsif @scanner.scan("=")
      [:EQUAL, nil]

Here, we match against some single-character syntax elements.

    elsif @scanner.scan(/\d+/)
      [:NUMBER, @scanner.matched]

We match against the regular expression \d+, which matches one or more digits. For this token, we record the value: @scanner.matched is the string containing the digits, e.g. "42".

The next bit deals with keywords and identifiers:

    elsif @scanner.scan(/\w+/)
      case @scanner.matched
      when "let"
        [:KW_LET, nil]
      when "print"
        [:KW_PRINT, nil]
      else
        [:IDENTIFIER, @scanner.matched]
      end

This code reads a sequence of alphanumeric characters — anything that could be a keyword or an identifier. Then, we check whether the matched string part is one of the predefined keywords.

Note that this correctly handles identifiers that start with a keyword. For example, "letter" is an identifier, not the keyword "let" followed by the identifier "ter".

    else
      char = @scanner.getch
      raise "Unknown character: #{char}"
    end
  end
end

Lastly, in case the scanner did not match anything, we raise an exception. It’s a crude way of reporting errors, but it’ll do for this basic lexer.

Testing it out

To test out this basic lexer, construct it with a string, and keep calling #next_token until that method returns nil:

str = "let amount = 42;"
lexer = BasicLexer.new(str)

loop do
  token = lexer.next_token
  break unless token

  p token
end

This code spits out the following, as expected:

[:KW_LET, nil]
[:WHITESPACE, nil]
[:IDENTIFIER, "amount"]
[:WHITESPACE, nil]
[:EQUAL, nil]
[:WHITESPACE, nil]
[:NUMBER, "42"]
[:SEMICOLON, nil]

We want to be able to parse code like this:

print "one plus two is ${1+2}";

In this hypothetical programming language, the expression between ${ and } is evaluated and embedded in the string. This code would thus effectively be the same as print "one plus two is 3";.

The contents of the string are in a different language. Outside of a string, ${ would be a lexer error. Also, one and plus are string literal parts, not identifiers. To make our lexer support this code, we’ll allow our lexer two switch between two modes.¹ First, we’ll have a main lexer mode, which will do pretty much the same as earlier BasicLexer. Secondly, we’ll have a string lexer mode, which will be used for handling the contents of strings.

However, toggling between these two modes is not enough. It is possible to have string interpolation nested within string interpolation:

print "why ${"${2+6}" * 5}";

The easiest way to support this in our ModalLexer class is using a stack of lexer modes:

class ModalLexer
  def initialize(str)
    scanner = StringScanner.new(str)
    mode = MainLexerMode.new(scanner)
    @mode_stack = [mode]
  end

The ModalLexer class will, like the previous BasicLexer class, be constructed with a string. It creates a stack of lexer modes, which will contain just a MainLexerMode to begin with. This string scanner will be shared across lexer modes.

  def next_token
    mode = @mode_stack.last
    mode.next_token(@mode_stack)
  end
end

Just like with BasicLexer, the only public method for this lexer is #next_token. However, the implementation here delegates to the current lexer mode, which sits at the top of the stack.

The #next_token method for lexer modes takes the mode stack as an argument. This will allow the lexer modes to modify this stack.

The main lexer mode

The implementation of MainLexerMode is similar to that of BasicLexer:

class MainLexerMode
  def initialize(scanner)
    @scanner = scanner
  end

A MainLexerMode is constructed with a string scanner, rather than a string. The string scanner needs to be shared across lexer modes. This way, when switching lexer modes, the new lexer mode can continue where the previous one left off.

  def next_token(mode_stack)

The #next_token method takes the stack of lexer modes as a parameter.

    if @scanner.eos?
      nil

As with BasicLexer, if the scanner is at the end of the string, we return nil.

    elsif @scanner.scan('"')
      mode = StringLexerMode.new(@scanner)
      mode_stack.push(mode)
      [:STRING_START, nil]

When we find a " character, we push a new StringLexerMode onto the lexer mode stack, and return a STRING_START token. This way, the next call to #next_token will be delegated to this new StringLexerMode instance, at the top of the stack.

The rest of the MainLexerMode implementation is identical to BasicLexer.

The string lexer mode

The StringLexerMode class is the lexer mode that is used inside string literals.

class StringLexerMode
  def initialize(scanner)
    @scanner = scanner
  end

As with MainLexerMode, a StringLexerMode is constructed with a string scanner.

  def next_token(mode_stack)
    if @scanner.eos?
      nil

The #next_token method starts identical to MainLexerMode, returning nil when the scanner is at the end of the string.

    elsif @scanner.scan('"')
      mode_stack.pop
      [:STRING_END, nil]

The " character indicates the end of the string. We pop the current lexer mode off the stack, and return a STRING_END token.

    elsif @scanner.scan("${")
      mode = InterpolationLexerMode.new(@scanner)
      mode_stack.push(mode)
      [:STRING_INTERP_START, nil]

When we find ${, we push a new lexer mode: InterpolationLexerMode. This is a to-be-implemented lexer mode which will be similar to MainLexerMode. We return a STRING_INTERP_START token.

    elsif @scanner.scan("$")
      [:STRING_PART_LIT, @scanner.matched]

When we find a $ character, we return a STRING_PART_LIT token with that $ character as the content of the token. A STRING_PART_LIT token is used for literal string content. We know that this $ won’t be followed by a {, because otherwise it’d have matched the earlier ${ case.

    elsif @scanner.scan(/[^"$]+/)
      [:STRING_PART_LIT, @scanner.matched]

When we find a character sequence that includes neither " nor $, we return a STRING_PART_LIT token.

    else
      char = @scanner.getch
      raise "Unknown character: #{char}"
    end
  end
end

Lastly, if the scanner did not match anything, we raise an exception.

The string interpolation lexer mode

The string interpolation lexer mode is used between ${ and } inside a string. This lexer mode is identical to MainLexerMode, with the difference that it needs to pop the lexer mode stack when it finds }.

class InterpolationLexerMode
  def initialize(scanner)
    @scanner = scanner
    @delegate = MainLexerMode.new(@scanner)
  end

The InterpolationLexerMode is constructed with a scanner, similar to the other lexer modes. It also creates a MainLexerMode, to which it will delegate.

  def next_token(mode_stack)
    if @scanner.scan("}")
      mode_stack.pop
      [:STRING_INTERP_END, nil]

If we find a } character, we pop the lexer mode stack. Afterwards, the lexer mode at the top of the stack will be a StringLexerMode. We return a STRING_INTERP_END token.

    else
      @delegate.next_token(mode_stack)
    end
  end
end

Otherwise, we delegate the #next_token call to the MainLexerMode instance.

Testing it out

As before, we can test out this modal lexer, by construct it with a string, and keep calling #next_token until that method returns nil:

string = 'print "one plus two is ${1+2}.";'
lexer = ModalLexer.new(string)

loop do
  token = lexer.next_token
  break unless token

  p token
end

This code spits out the following:

[:KW_PRINT, nil]
[:WHITESPACE, nil]
[:STRING_START, nil]
[:STRING_PART_LIT, "one plus two is "]
[:STRING_INTERP_START, nil]
[:NUMBER, "1"]
[:PLUS, nil]
[:NUMBER, "2"]
[:STRING_INTERP_END, nil]
[:STRING_PART_LIT, "."]
[:STRING_END, nil]
[:SEMICOLON, nil]

Note how the string starts with STRING_START and ends with STRING_END. Inside, we have literal parts (STRING_PART_LIT), as well as interpolated parts, which start with STRING_INTERP_START and end with STRING_INTERP_END.

Now everything is in place for writing a parser.

Hints for the parser

Writing a parser isn’t quite in scope for this article, but it wouldn’t be complete without touching on the subject at least briefly.

In a parser, you’d define a string as the collection of parts between a STRING_START and STRING_END token:

string      -> STRING_START string_part* STRING_END

A string part can be a literal part, or an interpolated part.

string_part -> STRING_PART_LIT

string_part -> STRING_INTERP_START
               expression
               STRING_INTERP_END

For string parts that are STRING_PART_LIT, you’d extract the token value. For interpolated string parts, you’d evaluate the expression and convert it to a string. This way, parsing simple strings works:

print "ab5yz";

(print
  (string
    (lit "ab5yz")))

Parsing strings with interpolation works as well:

print "ab${2+3}yz";

(print
  (string
    (lit "ab")
    (expr
      (+ (num 2) (num 3)))
    (lit "yz")))

To simplify this AST, you might consider de-sugaring this syntax. Here’s an example where I’ve replaced the string and expr AST nodes with calls to my concat() and toString() functions:

(print
  (funcall
    "concat"
    (lit "ab")
    (funcall
      "toString"
      (+ (num 2) (num 3)))
    (lit "yz")))

Wrapping up

This wraps up the modal lexing example, though plenty can can be done to improve this lexer. These are the bits I left out, for the sake of simplicity:

The StringLexerMode does not handle escape sequences, like \", \\, or \$. Escape sequences are indispensable in a real-world string lexer.
The lexer does not track source location (line number, column number, or filename). This too is indispensable for reporting errors.
Speaking of errors: The lexer’s error handling leaves much to be desired.
The lexer emits WHITESPACE tokens, which might not be desirable. These could be handled by the parser, or stripped entirely beforehand.
I’ve not paid attention to performance characteristics of this lexer. My assumption is that StringScanner is fast enough, especially with pre-compiled regular expressions. However, it might be worth looking into other approaches and tools for lexing.
The InterpolationLexerMode will pop the lexer mode stack as soon as it finds a }. This will cause problems if the } token is used for other purposes as well, such as defining functions (e.g. fn a() { … }) or creating maps (e.g. let thing = { value: 123 };). The trick here is to push a MainLexerMode when finding a {, and letting MainLexerMode pop the stack when finding a }. Now InterpolationLexerMode is redundant and can be substituted with MainLexerMode.

This modal lexing technique is not just usable for handling string interpolation. It can handle nested comment as well (e.g. /* left /* inner comment */ right */), and it even allows embedding entire different languages.

The possibilities are limited only by your imagination.

References

This article was inspired by How OSH Uses Lexer Modes (2016), which introduces the concept of modal lexing. You might also be interested in these:

This change turns the lexer from a finite automaton into a pushdown automaton. This means that the lexer no longer recognizes a regular language but a context-free language. The textbook theory says that lexers should be regular. But modal lexers are such an elegant solution that I’m quite happy to drop the theory and go with what actually works. ↩

Using a modal lexer for parsing sublanguages

Ruby’s string scanner

A basic lexer

Testing it out

A modal lexer

The main lexer mode

The string lexer mode

The string interpolation lexer mode

Testing it out

Hints for the parser

Wrapping up

References