an image

Writing a JSON parser (part 1)

In the previous post, I explained that I was writing a JSON parser. Never having written a parser, I thought about the problem a bit and imagined transforming a stringly input into some kind of data structure that represented the parsed JSON.

In that initial brainstorm, I thought that I would like only need to visit each part of the file once, and that I could tranform the characters into the final representation in a single pass. While I still think that's correct, it seems that there's a simple way for me to break this problem into two steps: lexing (aka scanning) and parsing.

In Crafting Interpreters (in fact, on its cover as an image of a mountain), Robert Nystrom breaks down the task of transforming source code into a stream of tokens via lexing or scanning and then transforming the tokens into an AST (abstract syntax tree) via parsing.

Armed with this knowledge, I decided to split my JSON parser into two parts: A lexer and a parser. The lexers job is to split the source code (a string with some JSON) in it into tokens. The parser's job will be to take those tokens and interpret them as useful JSON values / objects (or throw a JsonParseError if anything goes wrong).

I now have (at least) two ways my program can fail on its input: The lexer can fail, which should result in a JsonLexError or the parser can fail, which should result in a JsonParseError. Of course, I could (and probably will/should) make these variants of the same type, so that users of my library (read: me, next week) only need to deal with one error type. I've also learned a bit about idiomatic ways of handling errors, as well as some of the typical crates people reach for to generate boilerplate code for transforming errors of one type into another. I'll do that, but not right away. First things first, I want to get my lexer to recognize and output some Tokens.

Writing the Lexer

If I were parsing a programming language, I'd have quite a few tokens. Here is an excerpt from Crafting Interpreters (4.2.1):

enum TokenType {
  // Single-character tokens.
  LEFT_PAREN, RIGHT_PAREN, LEFT_BRACE, RIGHT_BRACE,
  COMMA, DOT, MINUS, PLUS, SEMICOLON, SLASH, STAR,

  // One or two character tokens.
  BANG, BANG_EQUAL, EQUAL, EQUAL_EQUAL,
  GREATER, GREATER_EQUAL, LESS, LESS_EQUAL,

  // Literals.
  IDENTIFIER, STRING, NUMBER,

  // Keywords.
  AND, CLASS, ELSE, FALSE, FUN, FOR, IF, NIL, OR,
  PRINT, RETURN, SUPER, THIS, TRUE, VAR, WHILE,

  EOF
}

Luckily for me, JSON doesn't need so many tokens. Json's grammar in McKeeman Form is:

json
   element

value
   object
   array
   string
   number
   "true"
   "false"
   "null"

object
    '{' ws '}'
    '{' members '}'

members
    member
    member ',' members

member
    ws string ws ':' element

array
    '[' ws ']'
    '[' elements ']'

elements
    element
    element ',' elements

element
    ws value ws

string
'"' characters '"'

characters
    ""
    character characters

character
    '0020' . '10FFFF' - '"' - '\'
'\' escape

escape
    '"'
    '\'
    '/'
    'b'
    'f'
    'n'
    'r'
    't'
    'u' hex hex hex hex

hex
    digit
    'A' . 'F'
    'a' . 'f'

number
    integer fraction exponent

integer
    digit
    onenine digits
    '-' digit
    '-' onenine digits

digits
    digit
    digit digits

digit
    '0'
    onenine

onenine
    '1' . '9'

fraction
    ""
    '.' digits

exponent
    ""
    'E' sign digits
    'e' sign digits

sign
    ""
    '+'
    '-'

ws
    ""
    '0020' ws
    '000A' ws
    '000D' ws
    '0009' ws

From here, we can start to recognize and pull out the tokens. Left and right curly braces will need to be tokens (that mark the beginning and end of an object that will have optional members), as will left and right square brackets (marking the beginning and end of arrays).

I tricked myself into thinking that quotation marks (") should also be tokens. After all, they mark the beginning and end of a string. I even thought that, like braces or square brackets, I might want a left quote and a right quote. I went ahead and implemented that, but after thinking about it I realized that this was the only time in my program that I would care about quotation marks. Everything after this (either the parser or the consumer of the parsed values) will only care about the actual string value. So instead of having TokenType::Quote, I have TokenType::String which I recognize by looking for quotation marks during lexing.

In the end, I identified 12 token types. Though one of them (EOF) is not strictly necessary. I did not make use of Rust's fancy enums. Perhaps later on as I move to simplify my program I will. (In fact, I would not be surprised if I end up combining lexing and parsing into one step as I had initially imagined.) But, at least for this initial version, my token types are:

pub enum TokenType {
    LeftBrace,
    RightBrace,
    LeftBracket,
    RightBracket,
    Colon,
    Comma,
    String,
    Number,
    True,
    False,
    Null,
    EOF,
}

And my lexer outputs Tokens of the form:

pub struct Token {
    pub token_type: TokenType,
    start: usize,
    end: usize,
    line_number: usize,
}

Here, start and end are indices into the source code. Since my source code is utf8, indices into the raw buffer do not act as indices into a char array. (The reason being that utf8 is a variable-width encoding. If you have no idea what I'm talking about, it's actually an interesting bit of useful information to absorb. I recommend reading about it and maybe even implementing some small programs to manually chunk a byte stream as utf8, or inspect each byte of a utf8 byte stream. To my surprise, I found this seemingly dull topic pretty engaging.)

Anyway, I store the start and end of the tokens as offsets into the source code's Vec<char>, which feels at least a little bit wrong. What I really want is not chars but, grapheme clusters. From the standard library function std::str::chars:

    /// Returns an iterator over the [`char`]s of a string slice.
    ///
    /// As a string slice consists of valid UTF-8, we can iterate through a
    /// string slice by [`char`]. This method returns such an iterator.
    ///
    /// It's important to remember that [`char`] represents a Unicode Scalar
    /// Value, and might not match your idea of what a 'character' is. Iteration
    /// over grapheme clusters may be what you actually want. This functionality
    /// is not provided by Rust's standard library, check crates.io instead.
    ///
    /// # Examples
    ///
    /// ...
    ///
    /// Remember, [`char`]s might not match your intuition about characters:
    ///
    /// [`char`]: prim@char
    ///

     let y = "y̆";

     let mut chars = y.chars();

     assert_eq!(Some('y'), chars.next()); // not 'y̆'
     assert_eq!(Some('\u{0306}'), chars.next());

     assert_eq!(None, chars.next());

Still, I don't think this will lead to any issues in my program, because no unicode scalar value (which is what rust's 4-byte char type is) will overlap with my lexing of any of the tokens. The only place non-ascii characters will show up at all (for valid json) is when I'm lexing a TokenType::String. At some point though, if there's an invalid character in the input, I'll want to return line and column numbers to the caller. It seems that emacs treats "y̆" as two characters: y and ̆. Therefore, perhaps keeping indices to the chars is the right thing after all, rather than indices of grapheme clusters, which might end up translating to column numbers in a way that's not very convenient to the user after all.

In any case, as I am writing this, I realized that I have not really handled column numbers anywhere in the program yet, so I'll add that to my TODO list:

- TODO Combine lexer / parser error types into one. Look into relevant crates and idiomatic code.
- TODO Provide useful debugging info to the caller (like column number of errors).
- TODO Catch multiple errors at once, if possible (to prevent whack-a-mole debugging).

Another thing that's in the back of my mind is that I should maybe not be storing indices at all, but rather string slices. But given that storing string slices into my Token structs would require me to introduce explicit lifetime annotations to the code, I decided not to try to deal with those just yet.

In any case, I've written an initial version of my lexer without thinking too hard about how the code might be made better. I figure I'll save that for when I have a more complete program. Perhaps later on I'll put all of the code I've written as part of the Computer Enhance course on github, or at least my JSON parser.

#[derive(Clone, Copy, Debug, PartialEq)]
pub enum TokenType {
    LeftBrace,
    RightBrace,
    LeftBracket,
    RightBracket,
    Colon,
    Comma,
    String,
    Number,
    True,
    False,
    Null,
    EOF,
}

#[derive(Clone, Copy, Debug, PartialEq)]
pub struct Token {
    pub token_type: TokenType,
    start: usize,
    end: usize,
    line_number: usize,
}

pub struct Lexer {
    pub tokens: Vec<Token>,
    source_string: String,
    source: Vec<char>,
    start: usize,
    current: usize,
    line: usize,
}

#[derive(Clone, Copy, Debug, PartialEq)]
pub struct LexerError;

type Result<T> = std::result::Result<T, LexerError>;

impl Lexer {
    pub fn new(s: &str) -> Self {
        Lexer {
            source_string: s.to_string(),
            source: s.chars().collect(),
            tokens: vec![],
            start: 0,
            current: 0,
            line: 0,
        }
    }

    fn scan_token(&mut self) -> Result<Token> {
        // Eat whitespace
        while self.start < self.source.len() && self.source[self.start].is_whitespace() {
            if self.source[self.start] == '\n' {
                self.line += 1;
            }
            self.start += 1;
            self.current = self.start;
        }

        // EOF
        if self.start == self.source.len() {
            let token = Token {
                token_type: TokenType::EOF,
                start: self.start,
                end: self.start,
                line_number: self.line,
            };
            self.tokens.push(token.clone());
            return Ok(token);
        }

        // Symbols
        let token_type = match self.source[self.current] {
            '{' => Some(TokenType::LeftBrace),
            '}' => Some(TokenType::RightBrace),
            '[' => Some(TokenType::LeftBracket),
            ']' => Some(TokenType::RightBracket),
            ':' => Some(TokenType::Colon),
            ',' => Some(TokenType::Comma),
            _ => None,
        };
        if let Some(token_type) = token_type {
            self.current += 1;
            let token = Token {
                token_type,
                start: self.start,
                end: self.current,
                line_number: self.line,
            };
            self.tokens.push(token.clone());
            self.start = self.current;

            return Ok(token);
        }

        match self.source[self.current] {
            // String
            '"' => {
                self.current += 1;
                while self.current < self.source.len() {
                    if self.source[self.current] == '"' {
                        let token = Token {
                            token_type: TokenType::String,
                            start: self.start + 1,
                            end: self.current - 1,
                            line_number: self.line,
                        };
                        self.tokens.push(token.clone());
                        self.current += 1;
                        self.start = self.current;
                        return Ok(token);
                    } else {
                        if self.source[self.current] == '\n' {
                            self.line += 1;
                        }
                        self.current += 1;
                    }
                }
            }
            // Number
            c if c.is_numeric() => {
                let mut dot = false;
                let mut after_dot = false;
                self.current += 1;
                while self.current < self.source.len() {
                    match self.source[self.current] {
                        next if next.is_numeric() => {
                            if dot {
                                after_dot = true;
                            }
                            self.current += 1;
                        }
                        '.' => {
                            if !dot {
                                dot = true;
                                self.current += 1;
                            }
                        }
                        _next => {
                            break;
                        }
                    }
                }

                // error if saw dot without followup
                if dot && !after_dot {
                    return Err(LexerError);
                } else {
                    let token = Token {
                        token_type: TokenType::Number,
                        start: self.start,
                        end: self.current,
                        line_number: self.line,
                    };
                    self.tokens.push(token.clone());
                    self.start = self.current;
                    return Ok(token);
                }
            }
            _ => {}
        }

        // True
        if self.source.len() >= self.start + 4
            && self.source[self.start..self.start + 4] == vec!['t', 'r', 'u', 'e']
        {
            let token = Token {
                token_type: TokenType::True,
                start: self.start,
                end: self.start + 4,
                line_number: self.line,
            };
            self.tokens.push(token.clone());
            self.current = self.start + 4;
            self.start = self.current;
            return Ok(token);
        }

        // False
        if self.source.len() >= self.start + 5
            && self.source[self.start..self.start + 5] == vec!['f', 'a', 'l', 's', 'e']
        {
            let token = Token {
                token_type: TokenType::False,
                start: self.start,
                end: self.start + 5,
                line_number: self.line,
            };
            self.tokens.push(token.clone());
            self.current = self.start + 5;
            self.start = self.current;
            return Ok(token);
        }

        // Null
        if self.source.len() >= self.start + 4
            && self.source[self.start..self.start + 4] == vec!['n', 'u', 'l', 'l']
        {
            let token = Token {
                token_type: TokenType::Null,
                start: self.start,
                end: self.start + 4,
                line_number: self.line,
            };
            self.tokens.push(token.clone());
            self.current = self.start + 4;
            self.start = self.current;
            return Ok(token);
        }

        Err(LexerError)
    }

    pub fn scan_tokens(&mut self) -> Result<()> {
        while self.scan_token()?.token_type != TokenType::EOF {}

        Ok(())
    }
}

#[cfg(test)]
mod lexer_tests {
    use super::*;

    #[test]
    fn lexer_can_scan_symbols() {
        let mut lexer = Lexer::new("{}[]:,");
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::LeftBrace);
        assert_eq!(
            lexer.scan_token().unwrap().token_type,
            TokenType::RightBrace
        );
        assert_eq!(
            lexer.scan_token().unwrap().token_type,
            TokenType::LeftBracket
        );
        assert_eq!(
            lexer.scan_token().unwrap().token_type,
            TokenType::RightBracket
        );
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::Colon);
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::Comma);
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::EOF);
    }

    #[test]
    fn lexer_can_scan_string() {
        let mut lexer = Lexer::new("\"foo\"");
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::String);
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::EOF);
    }

    #[test]
    fn lexer_can_scan_object() {
        let mut lexer = Lexer::new("{\"foo\":\"bar\"}");
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::LeftBrace);
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::String);
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::Colon);
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::String);
        assert_eq!(
            lexer.scan_token().unwrap().token_type,
            TokenType::RightBrace
        );
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::EOF);
    }

    #[test]
    fn lexer_can_skip_whitespace() {
        let mut lexer = Lexer::new("{  \"foo\"  : \n\"bar\"}\n");
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::LeftBrace);
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::String);
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::Colon);
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::String);
        assert_eq!(
            lexer.scan_token().unwrap().token_type,
            TokenType::RightBrace
        );
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::EOF);
    }

    #[test]
    fn lexer_returns_correct_positions() {
        let mut lexer = Lexer::new("\n{\n  \"foo\"  : \n\"bar\"\n}\n");
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::LeftBrace);
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::String);
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::Colon);
        let token = lexer.scan_token().unwrap();
        assert_eq!(token.token_type, TokenType::String);
        assert_eq!(token.start, 16);
        assert_eq!(token.end, 18);
        assert_eq!(token.line_number, 3);
        let token = lexer.scan_token().unwrap();
        assert_eq!(token.token_type, TokenType::RightBrace);
        assert_eq!(token.start, 21);
        assert_eq!(token.end, 22);
        assert_eq!(token.line_number, 4);
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::EOF);
    }

    #[test]
    fn lexer_can_scan_numbers() {
        let mut lexer = Lexer::new("[123]");
        assert_eq!(
            lexer.scan_token().unwrap().token_type,
            TokenType::LeftBracket
        );
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::Number);
        assert_eq!(
            lexer.scan_token().unwrap().token_type,
            TokenType::RightBracket
        );
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::EOF);

        let mut lexer = Lexer::new("123.456,");
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::Number);
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::Comma);
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::EOF);

        let mut lexer = Lexer::new("123.");
        assert!(lexer.scan_token().is_err());

        let mut lexer = Lexer::new(".123");
        assert!(lexer.scan_token().is_err());
    }

    #[test]
    fn lexer_can_scan_keywords() {
        let mut lexer = Lexer::new("truefalsenull");
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::True);
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::False);
        assert_eq!(lexer.scan_token().unwrap().token_type, TokenType::Null);

        let mut lexer = Lexer::new("tru");
        assert!(lexer.scan_token().is_err());
    }
}

Satisfyingly, the tests in my module all pass, and I'm ready to start working on the actual Parser. That will have to wait for the next post.

screenshot of successful test run