an image

Writing a JSON parser (part 2)

In the previous post, I shared how I parsed the "raw" string of json into tokens of various types. If I were writing a compiler, transpiler, or interpreter for a programming language, I would then want to parse these tokens to assemble an AST (abstract syntax tree).

To parse JSON, though, my job is significantly easier. I'll still want to end up with some kind of tree-like structure, but I won't have nearly as many token types to worry about. Nor will I need to worry about precedence order or left/right associativity.

I will still use the same basic technique outlined in Crafting Interpreters (section 6.2) called Recursive Descent.

In a nutshell, what this means is that I'll have my Lexer produce a stream of Tokens, and then parse a JsonObject value from that stream. My parse library is meant to only be used with a single value at the root, with as many values found recursively as necessary. Therefore, the top-level parse function is very straight-forward:

struct Parser {
    pub lexer: Lexer,
}

impl Parser {

    fn new(lexer: Lexer) -> Self {
        Parser { lexer }
    }

    // Other methods omitted

    fn parse(&mut self) -> Result<JsonObject> {
        self.lexer.scan_tokens()?;

        let tokens = self.lexer.tokens.clone();
        let tokens = &mut tokens.into_iter().peekable();
        let root = self.parse_value(tokens);

        match tokens.peek() {
            Some(token) if token.token_type == TokenType::EOF => root,
            _ => Err(JsonParseError), // Should only be one value at the root
        }
    }
}

There's a little bit of shenaniganry happening here. I'll go over all the little things I intend to revisit.

It's strange that the Parser has a Lexer in it, which we have to construct before calling Parser::new.

I generally feel leary about methods that give off "object-oriented" vibes. It's strange to me to call my_parser.parse() rather than "just" calling a function like Parser::parse(my_source_json). I don't need (nor do I want to) hold onto this Parser thing.

It's awkward that I need to .clone() the Lexer's .tokens. I should avoid that.

I consume the (cloned) tokens (with an into_iter) because I'm going to be transforming them into JsonObject enum values:

pub enum JsonObject {
    Object(HashMap<String, JsonObject>),
    Array(Box<Vec<JsonObject>>),
    String(String),
    Number(f64),
    True,
    False,
    Null,
}

I'll also sometimes want to peek ahead, and it's convenient that I can easily turn an iterator into a Peekable.

Parser::parse_value is where the magic happens. Its job is to construct a value at whatever part of the token stream we're at (&mut Peekable<IntoIter<Token>>). That might be something as simple as true, false, or null. Or it could be a number or string. The next token in the stream cannot be a colon, a comma, a right bracket, a right brace, or an end-of-file. If that's what we see when we call parse_value, then the JSON is invalid. It might have an extra colon, for example:

{
  "foo" : true,
  "bar" : : false
}

The "recursive descent" part of the parser is also in Parser::parse_value. If the token stream gives us a LeftBracket or a LeftBrace, then it means we need to create an array or an object (respectively). In those cases, we need to parse values (separated by commas) until we get to a RightBracket or RightBracket. I can store the parsed JsonObjects in a Vec (or more specifically, a Box<Vec<JsonObject>>) for arrays, and in a HashMap<String, JsonObject> for objects.

Aside: It occurs to me that it's weird to call the enum of all possible json values JsonObject, when Object is one of the variants. Perhaps I should rename this enum to something like JsonValue. Originally I didn't do that because the members of JsonObject::Objects are stored in HashMap<String, JsonObject>'s, which have keys and (more importantly) values. I didn't want to confuse the values or an object's members with JsonValues more generally, but in hindsight I chose a worse name. Besides -- the values of an object's members are JsonValues, so this really wasn't a problem to begin with. So, I'll probably change it.

When Parser::parse_value encounters a LeftBrace, it creates a HashMap<String, JsonObject> and then calls Parser::parse_member to get key-value pairs until it gets to the matching RightBrace (or an error):

    fn parse_member(
        &mut self,
        tokens: &mut Peekable<IntoIter<Token>>,
    ) -> Result<(String, JsonObject)> {
        let string = self.parse_string(tokens)?;
        match tokens.peek() {
            Some(colon) if colon.token_type == TokenType::Colon => {
                tokens.next();
            }
            _ => return Err(JsonParseError),
        }
        let value = self.parse_value(tokens)?;

        Ok((string, value))
    }

As you can see, members must:

If any of these invariants are violated, then we'll return a JsonParseError (with no helpful debugging information just yet.)

The third invariant calls Parser::parse_value recursively (with the same token stream), which illustrates "recursive descent".

My json parser is now functional. I asked chatgpt to generate me some sample json that I could use to test all the functionality. I might lean on it a little more to generate more of the test cases in general. I have not gone down that rabbit hole quite yet (but it's going on the TODO list).

I'm not very happy with where the code is now. There are a few additional things that I've added to my todo list:

- TODO Combine lexer / parser error types into one. Look into relevant crates and idiomatic code.
- TODO Provide useful debugging info to the caller (like column number of errors).
- TODO Catch multiple errors at once, if possible (to prevent whack-a-mole debugging).
- TODO Avoid calls to `clone`. I shouldn't need to duplicate the data that's in the raw json.
- TODO Combine the `Lexer`'s String, Number, True, False, and Null types with the Parser's version of the same thing.
- TODO Handle iteration through comma-separated values better so that I'm not embarrassed to share `Parser::parse_value`
- TODO Consider more convenient ways (getters?) to fetch parse data.
- TODO Consider more performant ways to store parsed data.

In the meantime, it feels good to see a passing test suite:

cargo test --benches --tests --all-features
   Compiling json_parser v0.1.0 (/home/john/src/sim86/json_parser)
    Finished test [unoptimized + debuginfo] target(s) in 0.19s
     Running unittests src/lib.rs (target/debug/deps/json_parser-c5f91c2e47445879)

running 12 tests
test lexer::lexer_tests::lexer_can_scan_numbers ... ok
test lexer::lexer_tests::lexer_can_scan_symbols ... ok
test lexer::lexer_tests::lexer_can_skip_whitespace ... ok
test lexer::lexer_tests::lexer_can_scan_string ... ok
test lexer::lexer_tests::lexer_can_scan_object ... ok
test lexer::lexer_tests::lexer_can_scan_keywords ... ok
test lexer::lexer_tests::lexer_returns_correct_positions ... ok
test parser::tests::parses_empty_object ... ok
test parser::tests::convenient_getters ... ok
test parser::tests::parses_simple_object ... ok
test parser::tests::parses_json ... ok
test parser::tests::parses_json_no_arrays ... ok

test result: ok. 12 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s


cargo-test finished at Sun Jun 25 19:00:28