Writing a JSON parser (part 1)
In the previous post, I explained that I was writing a JSON parser. Never having written a parser, I thought about the problem a bit and imagined transforming a stringly input into some kind of data structure that represented the parsed JSON.
In that initial brainstorm, I thought that I would like only need to visit each part of the file once, and that I could tranform the characters into the final representation in a single pass. While I still think that's correct, it seems that there's a simple way for me to break this problem into two steps: lexing
(aka scanning
) and parsing
.
In Crafting Interpreters (in fact, on its cover as an image of a mountain), Robert Nystrom breaks down the task of transforming source code
into a stream of tokens
via lexing
or scanning
and then transforming the tokens
into an AST
(abstract syntax tree) via parsing
.
Armed with this knowledge, I decided to split my JSON parser into two parts: A lexer
and a parser
. The lexer
s job is to split the source code (a string with some JSON) in it into tokens
. The parser
's job will be to take those tokens and interpret them as useful JSON values / objects (or throw a JsonParseError
if anything goes wrong).
I now have (at least) two ways my program can fail on its input: The lexer
can fail, which should result in a JsonLexError
or the parser
can fail, which should result in a JsonParseError
. Of course, I could (and probably will/should) make these variants of the same type, so that users of my library (read: me, next week) only need to deal with one error type. I've also learned a bit about idiomatic ways of handling errors, as well as some of the typical crates people reach for to generate boilerplate code for transforming errors of one type into another. I'll do that, but not right away. First things first, I want to get my lexer to recognize and output some Token
s.
Writing the Lexer
If I were parsing a programming language, I'd have quite a few tokens. Here is an excerpt from Crafting Interpreters (4.2.1):
enum TokenType {
// Single-character tokens.
LEFT_PAREN, RIGHT_PAREN, LEFT_BRACE, RIGHT_BRACE,
COMMA, DOT, MINUS, PLUS, SEMICOLON, SLASH, STAR,
// One or two character tokens.
BANG, BANG_EQUAL, EQUAL, EQUAL_EQUAL,
GREATER, GREATER_EQUAL, LESS, LESS_EQUAL,
// Literals.
IDENTIFIER, STRING, NUMBER,
// Keywords.
AND, CLASS, ELSE, FALSE, FUN, FOR, IF, NIL, OR,
PRINT, RETURN, SUPER, THIS, TRUE, VAR, WHILE,
EOF
}
Luckily for me, JSON doesn't need so many tokens. Json's grammar in McKeeman Form is:
json
element
value
object
array
string
number
"true"
"false"
"null"
object
'{' ws '}'
'{' members '}'
members
member
member ',' members
member
ws string ws ':' element
array
'[' ws ']'
'[' elements ']'
elements
element
element ',' elements
element
ws value ws
string
'"' characters '"'
characters
""
character characters
character
'0020' . '10FFFF' - '"' - '\'
'\' escape
escape
'"'
'\'
'/'
'b'
'f'
'n'
'r'
't'
'u' hex hex hex hex
hex
digit
'A' . 'F'
'a' . 'f'
number
integer fraction exponent
integer
digit
onenine digits
'-' digit
'-' onenine digits
digits
digit
digit digits
digit
'0'
onenine
onenine
'1' . '9'
fraction
""
'.' digits
exponent
""
'E' sign digits
'e' sign digits
sign
""
'+'
'-'
ws
""
'0020' ws
'000A' ws
'000D' ws
'0009' ws
From here, we can start to recognize and pull out the tokens. Left and right curly braces will need to be tokens (that mark the beginning and end of an object
that will have optional members
), as will left and right square brackets (marking the beginning and end of array
s).
I tricked myself into thinking that quotation marks ("
) should also be tokens. After all, they mark the beginning and end of a string. I even thought that, like braces or square brackets, I might want a left quote and a right quote. I went ahead and implemented that, but after thinking about it I realized that this was the only time in my program that I would care about quotation marks. Everything after this (either the parser
or the consumer of the parsed values) will only care about the actual string value. So instead of having TokenType::Quote
, I have TokenType::String
which I recognize by looking for quotation marks during lexing.
In the end, I identified 12 token types. Though one of them (EOF
) is not strictly necessary. I did not make use of Rust's fancy enum
s. Perhaps later on as I move to simplify my program I will. (In fact, I would not be surprised if I end up combining lexing and parsing into one step as I had initially imagined.) But, at least for this initial version, my token types are:
pub enum TokenType {
LeftBrace,
RightBrace,
LeftBracket,
RightBracket,
Colon,
Comma,
String,
Number,
True,
False,
Null,
EOF,
}
And my lexer outputs Token
s of the form:
pub struct Token {
pub token_type: TokenType,
start: usize,
end: usize,
line_number: usize,
}
Here, start
and end
are indices into the source code. Since my source code is utf8
, indices into the raw buffer do not act as indices into a char array. (The reason being that utf8
is a variable-width encoding. If you have no idea what I'm talking about, it's actually an interesting bit of useful information to absorb. I recommend reading about it and maybe even implementing some small programs to manually chunk a byte stream as utf8
, or inspect each byte of a utf8
byte stream. To my surprise, I found this seemingly dull topic pretty engaging.)
Anyway, I store the start
and end
of the tokens as offsets into the source code's Vec<char>
, which feels at least a little bit wrong. What I really want is not char
s but, grapheme clusters. From the standard library function std::str::chars
:
/// Returns an iterator over the [`char`]s of a string slice.
///
/// As a string slice consists of valid UTF-8, we can iterate through a
/// string slice by [`char`]. This method returns such an iterator.
///
/// It's important to remember that [`char`] represents a Unicode Scalar
/// Value, and might not match your idea of what a 'character' is. Iteration
/// over grapheme clusters may be what you actually want. This functionality
/// is not provided by Rust's standard library, check crates.io instead.
///
/// # Examples
///
/// ...
///
/// Remember, [`char`]s might not match your intuition about characters:
///
/// [`char`]: prim@char
///
let y = "y̆";
let mut chars = y.chars();
assert_eq!(Some('y'), chars.next()); // not 'y̆'
assert_eq!(Some('\u{0306}'), chars.next());
assert_eq!(None, chars.next());
Still, I don't think this will lead to any issues in my program, because no unicode scalar value (which is what rust's 4-byte char
type is) will overlap with my lexing of any of the tokens
. The only place non-ascii characters will show up at all (for valid json) is when I'm lexing a TokenType::String
. At some point though, if there's an invalid character in the input, I'll want to return line and column numbers to the caller. It seems that emacs treats "y̆"
as two characters: y
and ̆
. Therefore, perhaps keeping indices to the chars
is the right thing after all, rather than indices of grapheme clusters, which might end up translating to column numbers in a way that's not very convenient to the user after all.
In any case, as I am writing this, I realized that I have not really handled column numbers anywhere in the program yet, so I'll add that to my TODO list:
- TODO Combine lexer / parser error types into one. Look into relevant crates and idiomatic code.
- TODO Provide useful debugging info to the caller (like column number of errors).
- TODO Catch multiple errors at once, if possible (to prevent whack-a-mole debugging).
Another thing that's in the back of my mind is that I should maybe not be storing indices at all, but rather string slices. But given that storing string slices into my Token
structs would require me to introduce explicit lifetime annotations to the code, I decided not to try to deal with those just yet.
In any case, I've written an initial version of my lexer without thinking too hard about how the code might be made better. I figure I'll save that for when I have a more complete program. Perhaps later on I'll put all of the code I've written as part of the Computer Enhance course on github, or at least my JSON parser.
#[derive(Clone, Copy, Debug, PartialEq)]
pub enum TokenType {
LeftBrace,
RightBrace,
LeftBracket,
RightBracket,
Colon,
Comma,
String,
Number,
True,
False,
Null,
EOF,
}
#[derive(Clone, Copy, Debug, PartialEq)]
pub struct Token {
pub token_type: TokenType,
start: usize,
end: usize,
line_number: usize,
}
pub struct Lexer {
pub tokens: Vec<Token>,
source_string: String,
source: Vec<char>,
start: usize,
current: usize,
line: usize,
}
#[derive(Clone, Copy, Debug, PartialEq)]
pub struct LexerError;
type Result<T> = std::result::Result<T, LexerError>;
impl Lexer {
pub fn new(s: &str) -> Self {
Lexer {
source_string: s.to_string(),
source: s.chars().collect(),
tokens: vec![],
start: 0,
current: 0,
line: 0,
}
}
fn scan_token(&mut self) -> Result<Token> {
// Eat whitespace
while self.start < self.source.len() && self.source[self.start].is_whitespace() {
if self.source[self.start] == '\n' {
self.line += 1;
}
self.start += 1;
self.current = self.start;
}
// EOF
if self.start == self.source.len() {
let token = Token {
token_type: TokenType::EOF,
start: self.start,
end: self.start,
line_number: self.line,
};
self.tokens.push(token.clone());
return Ok(token);
}
// Symbols
let token_type = match self.source[self.current] {
'{' => Some(TokenType::LeftBrace),
'}' => Some(TokenType::RightBrace),
'[' => Some(TokenType::LeftBracket),
']' => Some(TokenType::RightBracket),
':' => Some(TokenType::Colon),
',' => Some(TokenType::Comma),
_ => None,
};
if let Some(token_type) = token_type {
self.current += 1;
let token = Token {
token_type,
start: self.start,
end: self.current,
line_number: self.line,
};
self.tokens.push(token.clone());
self.start = self.current;
return Ok(token);
}
match self.source[self.current] {
// String
'"' => {
self.current += 1;
while self.current < self.source.len() {
if self.source[self.current] == '"' {
let token = Token {
token_type: TokenType::String,
start: self.start + 1,
end: self.current - 1,
line_number: self.line,
};
self.tokens.push(token.clone());
self.current += 1;
self.start = self.current;
return Ok(token);
} else {
if self.source[self.current] == '\n' {
self.line += 1;
}
self.current += 1;
}
}
}
// Number
c if c.is_numeric() => {
let mut dot = false;
let mut after_dot = false;
self.current += 1;
while self.current < self.source.len() {
match self.source[self.current] {
next if next.is_numeric() => {
if dot {
after_dot = true;
}
self.current += 1;
}
'.' => {
if !dot {
dot = true;
self.current += 1;
}
}
_next => {
break;
}
}
}
// error if saw dot without followup
if dot && !after_dot {
return Err(LexerError);
} else {
let token = Token {
token_type: TokenType::Number,
start: self.start,
end: self.current,
line_number: self.line,
};
self.tokens.push(token.clone());
self.start = self.current;
return Ok(token);
}
}
_ => {}
}
// True
if self.source.len() >= self.start + 4
&& self.source[self.start..self.start + 4] == vec!['t', 'r', 'u', 'e']
{
let token = Token {
token_type: TokenType::True,
start: self.start,
end: self.start + 4,
line_number: self.line,
};
self.tokens.push(token.clone());
self.current = self.start + 4;
self.start = self.current;
return Ok(token);
}
// False
if self.source.len() >= self.start + 5
&& self.source[self.start..self.start + 5] == vec!['f', 'a', 'l', 's', 'e']
{
let token = Token {
token_type: TokenType::False,
start: self.start,
end: self.start + 5,
line_number: self.line,
};
self.tokens.push(token.clone());
self.current = self.start + 5;
self.start = self.current;
return Ok(token);
}
// Null
if self.source.len() >= self.start + 4
&& self.source[self.start..self.start + 4] == vec!['n', 'u', 'l', 'l']
{
let token = Token {
token_type: TokenType::Null,
start: self.start,
end: self.start + 4,
line_number: self.line,
};
self.tokens.push(token.clone());
self.current = self.start + 4;
self.start = self.current;
return Ok(token);
}
Err(LexerError)
}
pub fn scan_tokens(&mut self) -> Result<()> {
while self.scan_token()?.token_type != TokenType::EOF {}
Ok(())
}
}
#[cfg(test)]
mod lexer_tests {
use super::*;
#[test]
fn lexer_can_scan_symbols() {
let mut lexer = Lexer::new("{}[]:,");
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::LeftBrace);
assert_eq!(
lexer.scan_token().unwrap().token_type,
TokenType::RightBrace
);
assert_eq!(
lexer.scan_token().unwrap().token_type,
TokenType::LeftBracket
);
assert_eq!(
lexer.scan_token().unwrap().token_type,
TokenType::RightBracket
);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::Colon);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::Comma);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::EOF);
}
#[test]
fn lexer_can_scan_string() {
let mut lexer = Lexer::new("\"foo\"");
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::String);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::EOF);
}
#[test]
fn lexer_can_scan_object() {
let mut lexer = Lexer::new("{\"foo\":\"bar\"}");
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::LeftBrace);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::String);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::Colon);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::String);
assert_eq!(
lexer.scan_token().unwrap().token_type,
TokenType::RightBrace
);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::EOF);
}
#[test]
fn lexer_can_skip_whitespace() {
let mut lexer = Lexer::new("{ \"foo\" : \n\"bar\"}\n");
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::LeftBrace);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::String);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::Colon);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::String);
assert_eq!(
lexer.scan_token().unwrap().token_type,
TokenType::RightBrace
);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::EOF);
}
#[test]
fn lexer_returns_correct_positions() {
let mut lexer = Lexer::new("\n{\n \"foo\" : \n\"bar\"\n}\n");
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::LeftBrace);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::String);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::Colon);
let token = lexer.scan_token().unwrap();
assert_eq!(token.token_type, TokenType::String);
assert_eq!(token.start, 16);
assert_eq!(token.end, 18);
assert_eq!(token.line_number, 3);
let token = lexer.scan_token().unwrap();
assert_eq!(token.token_type, TokenType::RightBrace);
assert_eq!(token.start, 21);
assert_eq!(token.end, 22);
assert_eq!(token.line_number, 4);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::EOF);
}
#[test]
fn lexer_can_scan_numbers() {
let mut lexer = Lexer::new("[123]");
assert_eq!(
lexer.scan_token().unwrap().token_type,
TokenType::LeftBracket
);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::Number);
assert_eq!(
lexer.scan_token().unwrap().token_type,
TokenType::RightBracket
);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::EOF);
let mut lexer = Lexer::new("123.456,");
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::Number);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::Comma);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::EOF);
let mut lexer = Lexer::new("123.");
assert!(lexer.scan_token().is_err());
let mut lexer = Lexer::new(".123");
assert!(lexer.scan_token().is_err());
}
#[test]
fn lexer_can_scan_keywords() {
let mut lexer = Lexer::new("truefalsenull");
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::True);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::False);
assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::Null);
let mut lexer = Lexer::new("tru");
assert!(lexer.scan_token().is_err());
}
}
Satisfyingly, the tests in my module all pass, and I'm ready to start working on the actual Parser
. That will have to wait for the next post.