Using a modal lexer for parsing sublanguages
How do you write a lexer for print "one plus two is ${1+2}";
? The string interpolation part makes this tricky, as the string forms a sublanguage that is different from the main language. To make things worse, interpolation can be nested. Textbook lexers won’t cut it, and we need something stronger.
Enter modal lexers. These lexers can switch between different modes for lexing different languages. In this article, I’ll run you through an example of how to build such a modal lexer.
For this article, I’m using Ruby as the programming language. The same technique will work in other programming languages too. No external dependencies (gems) are used.
Ruby’s string scanner
Ruby provides a StringScanner
class, which is is an excellent building block for lexers. It offers a clean API to split up a string into individual parts.
I’ll run you through a simple example, so that you can get a feel of how it works. First, we require strscan
, and then we instantiate the StringScanner
with the string to scan:
require "strscan"
str = "let amount = 5;"
scanner = StringScanner.new(str)
The #scan
method, when given a string part, will attempt to match this given string part:
scanner.scan("let") # => "let"
The scanner starts at the beginning of the string. When a match is successful, the scanner will move forward. Because we just matched "let"
, the scanner has moved past this "let"
, and attempting to match it again will fail:
scanner.scan("let") # => nil
The #scan
method returns the matched string part, or nil
if there is no match. In the latter case, the scanner stays at its current position.
This method can also take a regular expression:
scanner.scan(/\s+/) # => " "
The regular expression \s+
matches one or more whitespace characters, such as spaces, tabs, and newlines.
Now that we’ve matched "let"
and a space character, we can match the next part: an identifier, consisting of alphanumeric characters. For this, the regular expression \w+
comes in handy, which matches one or more alphanumeric characters:
scanner.scan(/\w+/) # => "amount"
Another useful method is #matched
, which returns the matched string part:
scanner.matched # => "amount"
The StringScanner
provides more functionality, but #scan
and #matched
is all we need to write a lexer.
A basic lexer
Before writing a modal lexer, we’ll start with a basic lexer:
class BasicLexer
def initialize(str)
@scanner = StringScanner.new(str)
end
This BasicLexer
is constructed with a string. The string itself isn’t important; we’ll only use the StringScanner
.
def next_token
The only public method for this lexer is #next_token
, which will return a single token, or nil
when there are no more tokens.
if @scanner.eos?
nil
If the scanner is at the end of the string (eos
stands for end-of-string), we’ll return nil
.
elsif @scanner.scan(/\s+/)
[:WHITESPACE, nil]
If we match one or more whitespace characters (\s+
), we return a WHITESPACE
token.
Tokens are arrays with two elements: first, the token type as an uppercase symbol, and second, an optional value. The value for this WHITESPACE
token is nil
; we won’t use the value for anything.
elsif @scanner.scan("+")
[:PLUS, nil]
elsif @scanner.scan("*")
[:TIMES, nil]
elsif @scanner.scan("-")
[:MINUS, nil]
elsif @scanner.scan("/")
[:DIVIDE, nil]
elsif @scanner.scan(";")
[:SEMICOLON, nil]
elsif @scanner.scan("=")
[:EQUAL, nil]
Here, we match against some single-character syntax elements.
elsif @scanner.scan(/\d+/)
[:NUMBER, @scanner.matched]
We match against the regular expression \d+
, which matches one or more digits. For this token, we record the value: @scanner.matched
is the string containing the digits, e.g. "42"
.
The next bit deals with keywords and identifiers:
elsif @scanner.scan(/\w+/)
case @scanner.matched
when "let"
[:KW_LET, nil]
when "print"
[:KW_PRINT, nil]
else
[:IDENTIFIER, @scanner.matched]
end
This code reads a sequence of alphanumeric characters — anything that could be a keyword or an identifier. Then, we check whether the matched string part is one of the predefined keywords.
Note that this correctly handles identifiers that start with a keyword. For example, "letter"
is an identifier, not the keyword "let"
followed by the identifier "ter"
.
else
char = @scanner.getch
raise "Unknown character: #{char}"
end
end
end
Lastly, in case the scanner did not match anything, we raise an exception. It’s a crude way of reporting errors, but it’ll do for this basic lexer.
Testing it out
To test out this basic lexer, construct it with a string, and keep calling #next_token
until that method returns nil
:
str = "let amount = 42;"
lexer = BasicLexer.new(str)
loop do
token = lexer.next_token
break unless token
p token
end
This code spits out the following, as expected:
[:KW_LET, nil]
[:WHITESPACE, nil]
[:IDENTIFIER, "amount"]
[:WHITESPACE, nil]
[:EQUAL, nil]
[:WHITESPACE, nil]
[:NUMBER, "42"]
[:SEMICOLON, nil]
A modal lexer
We want to be able to parse code like this:
print "one plus two is ${1+2}";
In this hypothetical programming language, the expression between ${
and }
is evaluated and embedded in the string. This code would thus effectively be the same as print "one plus two is 3";
.
The contents of the string are in a different language. Outside of a string, ${
would be a lexer error. Also, one
and plus
are string literal parts, not identifiers. To make our lexer support this code, we’ll allow our lexer two switch between two modes.1 First, we’ll have a main lexer mode, which will do pretty much the same as earlier BasicLexer
. Secondly, we’ll have a string lexer mode, which will be used for handling the contents of strings.
However, toggling between these two modes is not enough. It is possible to have string interpolation nested within string interpolation:
print "why ${"${2+6}" * 5}";
The easiest way to support this in our ModalLexer
class is using a stack of lexer modes:
class ModalLexer
def initialize(str)
scanner = StringScanner.new(str)
mode = MainLexerMode.new(scanner)
@mode_stack = [mode]
end
The ModalLexer
class will, like the previous BasicLexer
class, be constructed with a string. It creates a stack of lexer modes, which will contain just a MainLexerMode
to begin with. This string scanner will be shared across lexer modes.
def next_token
mode = @mode_stack.last
mode.next_token(@mode_stack)
end
end
Just like with BasicLexer
, the only public method for this lexer is #next_token
. However, the implementation here delegates to the current lexer mode, which sits at the top of the stack.
The #next_token
method for lexer modes takes the mode stack as an argument. This will allow the lexer modes to modify this stack.
The main lexer mode
The implementation of MainLexerMode
is similar to that of BasicLexer
:
class MainLexerMode
def initialize(scanner)
@scanner = scanner
end
A MainLexerMode
is constructed with a string scanner, rather than a string. The string scanner needs to be shared across lexer modes. This way, when switching lexer modes, the new lexer mode can continue where the previous one left off.
def next_token(mode_stack)
The #next_token
method takes the stack of lexer modes as a parameter.
if @scanner.eos?
nil
As with BasicLexer
, if the scanner is at the end of the string, we return nil
.
elsif @scanner.scan('"')
mode = StringLexerMode.new(@scanner)
mode_stack.push(mode)
[:STRING_START, nil]
When we find a "
character, we push a new StringLexerMode
onto the lexer mode stack, and return a STRING_START
token. This way, the next call to #next_token
will be delegated to this new StringLexerMode
instance, at the top of the stack.
The rest of the MainLexerMode
implementation is identical to BasicLexer
.
The string lexer mode
The StringLexerMode
class is the lexer mode that is used inside string literals.
class StringLexerMode
def initialize(scanner)
@scanner = scanner
end
As with MainLexerMode
, a StringLexerMode
is constructed with a string scanner.
def next_token(mode_stack)
if @scanner.eos?
nil
The #next_token
method starts identical to MainLexerMode
, returning nil
when the scanner is at the end of the string.
elsif @scanner.scan('"')
mode_stack.pop
[:STRING_END, nil]
The "
character indicates the end of the string. We pop the current lexer mode off the stack, and return a STRING_END
token.
elsif @scanner.scan("${")
mode = InterpolationLexerMode.new(@scanner)
mode_stack.push(mode)
[:STRING_INTERP_START, nil]
When we find ${
, we push a new lexer mode: InterpolationLexerMode
. This is a to-be-implemented lexer mode which will be similar to MainLexerMode
. We return a STRING_INTERP_START
token.
elsif @scanner.scan("$")
[:STRING_PART_LIT, @scanner.matched]
When we find a $
character, we return a STRING_PART_LIT
token with that $
character as the content of the token. A STRING_PART_LIT
token is used for literal string content. We know that this $
won’t be followed by a {
, because otherwise it’d have matched the earlier ${
case.
elsif @scanner.scan(/[^"$]+/)
[:STRING_PART_LIT, @scanner.matched]
When we find a character sequence that includes neither "
nor $
, we return a STRING_PART_LIT
token.
else
char = @scanner.getch
raise "Unknown character: #{char}"
end
end
end
Lastly, if the scanner did not match anything, we raise an exception.
The string interpolation lexer mode
The string interpolation lexer mode is used between ${
and }
inside a string. This lexer mode is identical to MainLexerMode
, with the difference that it needs to pop the lexer mode stack when it finds }
.
class InterpolationLexerMode
def initialize(scanner)
@scanner = scanner
@delegate = MainLexerMode.new(@scanner)
end
The InterpolationLexerMode
is constructed with a scanner, similar to the other lexer modes. It also creates a MainLexerMode
, to which it will delegate.
def next_token(mode_stack)
if @scanner.scan("}")
mode_stack.pop
[:STRING_INTERP_END, nil]
If we find a }
character, we pop the lexer mode stack. Afterwards, the lexer mode at the top of the stack will be a StringLexerMode
. We return a STRING_INTERP_END
token.
else
@delegate.next_token(mode_stack)
end
end
end
Otherwise, we delegate the #next_token
call to the MainLexerMode
instance.
Testing it out
As before, we can test out this modal lexer, by construct it with a string, and keep calling #next_token
until that method returns nil
:
string = 'print "one plus two is ${1+2}.";'
lexer = ModalLexer.new(string)
loop do
token = lexer.next_token
break unless token
p token
end
This code spits out the following:
[:KW_PRINT, nil]
[:WHITESPACE, nil]
[:STRING_START, nil]
[:STRING_PART_LIT, "one plus two is "]
[:STRING_INTERP_START, nil]
[:NUMBER, "1"]
[:PLUS, nil]
[:NUMBER, "2"]
[:STRING_INTERP_END, nil]
[:STRING_PART_LIT, "."]
[:STRING_END, nil]
[:SEMICOLON, nil]
Note how the string starts with STRING_START
and ends with STRING_END
. Inside, we have literal parts (STRING_PART_LIT
), as well as interpolated parts, which start with STRING_INTERP_START
and end with STRING_INTERP_END
.
Now everything is in place for writing a parser.
Hints for the parser
Writing a parser isn’t quite in scope for this article, but it wouldn’t be complete without touching on the subject at least briefly.
In a parser, you’d define a string as the collection of parts between a STRING_START
and STRING_END
token:
string -> STRING_START string_part* STRING_END
A string part can be a literal part, or an interpolated part.
string_part -> STRING_PART_LIT
string_part -> STRING_INTERP_START
expression
STRING_INTERP_END
For string parts that are STRING_PART_LIT
, you’d extract the token value. For interpolated string parts, you’d evaluate the expression and convert it to a string. This way, parsing simple strings works:
print "ab5yz";
(print
(string
(lit "ab5yz")))
Parsing strings with interpolation works as well:
print "ab${2+3}yz";
(print
(string
(lit "ab")
(expr
(+ (num 2) (num 3)))
(lit "yz")))
To simplify this AST, you might consider de-sugaring this syntax. Here’s an example where I’ve replaced the string
and expr
AST nodes with calls to my concat()
and toString()
functions:
(print
(funcall
"concat"
(lit "ab")
(funcall
"toString"
(+ (num 2) (num 3)))
(lit "yz")))
Wrapping up
This wraps up the modal lexing example, though plenty can can be done to improve this lexer. These are the bits I left out, for the sake of simplicity:
The
StringLexerMode
does not handle escape sequences, like\"
,\\
, or\$
. Escape sequences are indispensable in a real-world string lexer.The lexer does not track source location (line number, column number, or filename). This too is indispensable for reporting errors.
Speaking of errors: The lexer’s error handling leaves much to be desired.
The lexer emits
WHITESPACE
tokens, which might not be desirable. These could be handled by the parser, or stripped entirely beforehand.I’ve not paid attention to performance characteristics of this lexer. My assumption is that
StringScanner
is fast enough, especially with pre-compiled regular expressions. However, it might be worth looking into other approaches and tools for lexing.The
InterpolationLexerMode
will pop the lexer mode stack as soon as it finds a}
. This will cause problems if the}
token is used for other purposes as well, such as defining functions (e.g.fn a() { … }
) or creating maps (e.g.let thing = { value: 123 };
). The trick here is to push aMainLexerMode
when finding a{
, and lettingMainLexerMode
pop the stack when finding a}
. NowInterpolationLexerMode
is redundant and can be substituted withMainLexerMode
.
This modal lexing technique is not just usable for handling string interpolation. It can handle nested comment as well (e.g. /* left /* inner comment */ right */
), and it even allows embedding entire different languages.
The possibilities are limited only by your imagination.
References
This article was inspired by How OSH Uses Lexer Modes (2016), which introduces the concept of modal lexing. You might also be interested in these:
This change turns the lexer from a finite automaton into a pushdown automaton. This means that the lexer no longer recognizes a regular language but a context-free language. The textbook theory says that lexers should be regular. But modal lexers are such an elegant solution that I’m quite happy to drop the theory and go with what actually works. ↩