an image

Writing a JSON parser (part 0)

I'm taking Casey Muratori's Performance Aware Programming course, and my next homework assignment is to write a json parser.

I think the purpose of this homework exercise is to:

  1. Help us become more aware of the cost of each thing we are asking the computer to do. For example, json.parse(my_string) is one line of code, but could be very demanding.

  2. Have a codebase to start with that is totally under our control, so that we can optimize it later. This may end up being especially important if 90% or more of our program's cycles are spent parsing JSON.

The Hubs client also parses a lot of JSON. Each network message from other players (besides voice) is sent as json. That means that as a player moves around the screen, the sending client serializes the motion information as json and then sends it over the wire for the receiver to parse and then update their local simulation. While we've started using a binary format for part of this, most of the updates are still just done with json. I wonder how much time the client is taking on this.

I have never written a JSON parser before, and I have a wonderful-looking book on my desk called Crafting Interpreters by Robert Nystrom. My plan is to spend 30-60 minutes on a "blind attempt" at writing a parser to start thinking about the problem, and then (if I'm not making great progress, or if I just want to take a break as it's the end of the day) crack open this book to the parsing section and seeing if there's some valuable insight from there.

I'm excited to get started!

Thinking through the problem a bit

JSON (JavaScript Object Notation) is a data-interchange "language" or format.

JSON has a grammar, which is expressed on the previous site in McKeeman Form. Information about what a grammar is and how to read JSON's grammar is on Douglas Crockford's webpage, McKeeman Form.

In my rust program, I'll want to expose a public function that accepts an &str and returns something like a Result<Json, JsonParseError>:

pub fn parse(s: &str) -> Result<Json, JsonParseError> { ... }

According to the grammar, a json is an element, which is ws value ws. In other words, json is a value surrounded by ws (which is optional whitespace).

Hmm, I suppose though that I don't actually care to return whitespace information from my parse function. As I'm parsing, I may need to know when whitespace is or is not valid and to skip over it accordingly, but that information is only relevant while parsing.

So I suppose I'll need to have at least two types of data/knowledge in my program: First, I need the rules of the JSON grammar. I need to know which characters are valid at any given point in the string that I'm parsing, and how to combine those characters into chunks. Second, I need the data types and structures that represent my parsed JSON values:

Parsing a string, number, true, false, or null all seem very straightforward. I just peek at the characters wherever I am in the string and look for a closing " (in the case of a string), the optional . and first non-numeric character (in the case of a number), or the exact characters "true" | "false" | "null".

Parsing an array or object is a little trickier, especially because they can be nested:

{
  "one": {
    "two": {
      "three": {
        "four": null
      }
    }
  }
}

or

[
  "this",
  ["is"],
  [["a"]],
  [[["confusing"]], "array"],
  "of",
  ["array like things"]
]

While I'm parsing, perhaps I'll have a stack that keeps track of the element that I am currently in the middle of parsing. Whenever I complete an element (which is a value surrounded by whitespace), I'll convert the value into a JsonValue and pop an item (or a few?) off of the parsing stack. I'm not sure yet where the JsonValue will go.

If I'm parsing an array, that JsonValue will eventually need to end up in the array... I suppose I could create a Vec<JsonValue> whenever I see a [ (assuming it's in a valid spot) and as I finish parsing elements I'll insert the JsonValue in the Vec.

Hmm... What about objects? I'm not very familiar with HashMaps in rust just yet, but I assume that I can do something similar as what I described above for arrays: Create a HashMap<String, JsonValue> whenever I encounter a {, and keep adding to it whenever it is the "active item" on the stack. Perhaps the stack needs to have:

The "object in the grammar" and the "resulting collection" for arrays and hashmaps. I'm not totally sure. But overall it feels like this approach or something like it should work.