Writing a JSON parser (part 2)
In the previous post, I shared how I parsed the "raw" string of json into tokens of various types. If I were writing a compiler, transpiler, or interpreter for a programming language, I would then want to parse these tokens to assemble an AST
(abstract syntax tree).
To parse JSON, though, my job is significantly easier. I'll still want to end up with some kind of tree-like structure, but I won't have nearly as many token types to worry about. Nor will I need to worry about precedence order or left/right associativity.
I will still use the same basic technique outlined in Crafting Interpreters (section 6.2) called Recursive Descent.
In a nutshell, what this means is that I'll have my Lexer
produce a stream of Tokens
, and then parse a JsonObject
value from that stream. My parse library is meant to only be used with a single value at the root, with as many values found recursively as necessary. Therefore, the top-level parse
function is very straight-forward:
struct Parser {
pub lexer: Lexer,
}
impl Parser {
fn new(lexer: Lexer) -> Self {
Parser { lexer }
}
// Other methods omitted
fn parse(&mut self) -> Result<JsonObject> {
self.lexer.scan_tokens()?;
let tokens = self.lexer.tokens.clone();
let tokens = &mut tokens.into_iter().peekable();
let root = self.parse_value(tokens);
match tokens.peek() {
Some(token) if token.token_type == TokenType::EOF => root,
_ => Err(JsonParseError), // Should only be one value at the root
}
}
}
There's a little bit of shenaniganry happening here. I'll go over all the little things I intend to revisit.
It's strange that the Parser
has a Lexer
in it, which we have to construct before calling Parser::new
.
I generally feel leary about methods that give off "object-oriented" vibes. It's strange to me to call my_parser.parse()
rather than "just" calling a function like Parser::parse(my_source_json)
. I don't need (nor do I want to) hold onto this Parser
thing.
It's awkward that I need to .clone()
the Lexer
's .tokens
. I should avoid that.
I consume the (cloned) tokens (with an into_iter
) because I'm going to be transforming them into JsonObject
enum values:
pub enum JsonObject {
Object(HashMap<String, JsonObject>),
Array(Box<Vec<JsonObject>>),
String(String),
Number(f64),
True,
False,
Null,
}
I'll also sometimes want to peek ahead, and it's convenient that I can easily turn an iterator into a Peekable
.
Parser::parse_value
is where the magic happens. Its job is to construct a value at whatever part of the token stream we're at (&mut Peekable<IntoIter<Token>>
). That might be something as simple as true
, false
, or null
. Or it could be a number or string. The next token in the stream cannot be a colon, a comma, a right bracket, a right brace, or an end-of-file. If that's what we see when we call parse_value
, then the JSON is invalid. It might have an extra colon, for example:
{
"foo" : true,
"bar" : : false
}
The "recursive descent" part of the parser is also in Parser::parse_value
. If the token stream gives us a LeftBracket
or a LeftBrace
, then it means we need to create an array or an object (respectively). In those cases, we need to parse values (separated by commas) until we get to a RightBracket
or RightBracket
. I can store the parsed JsonObject
s in a Vec
(or more specifically, a Box<Vec<JsonObject>>
) for arrays, and in a HashMap<String, JsonObject>
for objects.
Aside: It occurs to me that it's weird to call the enum of all possible json values
JsonObject
, whenObject
is one of the variants. Perhaps I should rename this enum to something likeJsonValue
. Originally I didn't do that because the members ofJsonObject::Objects
are stored inHashMap<String, JsonObject>
's, which have keys and (more importantly) values. I didn't want to confuse the values or an object's members withJsonValue
s more generally, but in hindsight I chose a worse name. Besides -- the values of an object's members areJsonValue
s, so this really wasn't a problem to begin with. So, I'll probably change it.
When Parser::parse_value
encounters a LeftBrace
, it creates a HashMap<String, JsonObject>
and then calls Parser::parse_member
to get key-value pairs until it gets to the matching RightBrace
(or an error):
fn parse_member(
&mut self,
tokens: &mut Peekable<IntoIter<Token>>,
) -> Result<(String, JsonObject)> {
let string = self.parse_string(tokens)?;
match tokens.peek() {
Some(colon) if colon.token_type == TokenType::Colon => {
tokens.next();
}
_ => return Err(JsonParseError),
}
let value = self.parse_value(tokens)?;
Ok((string, value))
}
As you can see, members must:
- Start with a string,
- Be separated by a colon
- Have a valid
JsonObject
as its value
If any of these invariants are violated, then we'll return a JsonParseError
(with no helpful debugging information just yet.)
The third invariant calls Parser::parse_value
recursively (with the same token stream), which illustrates "recursive descent".
My json parser is now functional. I asked chatgpt to generate me some sample json that I could use to test all the functionality. I might lean on it a little more to generate more of the test cases in general. I have not gone down that rabbit hole quite yet (but it's going on the TODO list).
I'm not very happy with where the code is now. There are a few additional things that I've added to my todo list:
- TODO Combine lexer / parser error types into one. Look into relevant crates and idiomatic code.
- TODO Provide useful debugging info to the caller (like column number of errors).
- TODO Catch multiple errors at once, if possible (to prevent whack-a-mole debugging).
- TODO Avoid calls to `clone`. I shouldn't need to duplicate the data that's in the raw json.
- TODO Combine the `Lexer`'s String, Number, True, False, and Null types with the Parser's version of the same thing.
- TODO Handle iteration through comma-separated values better so that I'm not embarrassed to share `Parser::parse_value`
- TODO Consider more convenient ways (getters?) to fetch parse data.
- TODO Consider more performant ways to store parsed data.
In the meantime, it feels good to see a passing test suite:
cargo test --benches --tests --all-features
Compiling json_parser v0.1.0 (/home/john/src/sim86/json_parser)
Finished test [unoptimized + debuginfo] target(s) in 0.19s
Running unittests src/lib.rs (target/debug/deps/json_parser-c5f91c2e47445879)
running 12 tests
test lexer::lexer_tests::lexer_can_scan_numbers ... ok
test lexer::lexer_tests::lexer_can_scan_symbols ... ok
test lexer::lexer_tests::lexer_can_skip_whitespace ... ok
test lexer::lexer_tests::lexer_can_scan_string ... ok
test lexer::lexer_tests::lexer_can_scan_object ... ok
test lexer::lexer_tests::lexer_can_scan_keywords ... ok
test lexer::lexer_tests::lexer_returns_correct_positions ... ok
test parser::tests::parses_empty_object ... ok
test parser::tests::convenient_getters ... ok
test parser::tests::parses_simple_object ... ok
test parser::tests::parses_json ... ok
test parser::tests::parses_json_no_arrays ... ok
test result: ok. 12 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
cargo-test finished at Sun Jun 25 19:00:28