Part Ten: Starting Again

In Part Nine I described the reasons for switching to Rowan and a lexer over the current lexerless parsing system we’ve developed in crate::utils. Read that before reading on here.

Wiping the project

Sadly, the moment has arrived. It’s time to delete what we’ve created, since almost none of it will be reusable.

$ rm -r crates/{eldiro,eldiro-cli}/src

Let’s add in placeholder files so everything still compiles:

// crates/eldiro/src/lib.rs

#[cfg(test)]
mod tests {
    #[test]
    fn it_works() {
        assert_eq!(2 + 2, 4);
    }
}
// crates/eldiro-cli/src/main.rs

fn main() {
    println!("Hello, world!");
}

A lexer

To write our lexer, we’ll use the Logos crate. This will generate a lexer for us that’s really fast (according to the project README) without much effort on our part. Let’s start by adding Logos as a dependency to eldiro:

# crates/eldiro/Cargo.toml

# snip

[dependencies]
logos = "0.11.4"

Let’s also create a module for our lexer to live in:

// crates/eldiro/src/lib.rs

mod lexer;

We can now open src/lexer.rs and start writing our lexer. The first thing we need to declare is an enum that will represent the types of tokens our lexer can, well, lex. Why don’t we start with something easy – whitespace, maybe?

enum Token {
    Whitespace,
}

What if we encounter a piece of text that doesn’t match any token? We need a variant for that, too:

enum Token {
    Whitespace,
    Error,
}

When I first learnt about this I thought it was incredibly perverse. Why let every single token in your implementation be an error when you could instead represent this in a much cleaner way using Result? Remember, though, that our parser has to be lossless, meaning that it must represent the input flawlessly and always return an output, no matter how malformed the input is.

A quick aside about Rowan

Rowan manages to represent both an arbitrary input, as well as an arbitrary parser output. How does it do this? Here’s a code sketch slightly modified from the rust-analyzer docs:

#[derive(PartialEq, Eq, Clone, Copy)]
struct SyntaxKind(u16);

#[derive(PartialEq, Eq, Clone)]
struct SyntaxNode {
    kind: SyntaxKind,
    children: Vec<Arc<SyntaxElement>>,
}

#[derive(PartialEq, Eq, Clone)]
enum SyntaxElement {
    Node(SyntaxNode),
    Token(SyntaxToken),
}

#[derive(PartialEq, Eq, Clone)]
struct SyntaxToken {
    kind: SyntaxKind,
    text: String,
}

Confusingly, SyntaxKind here is our own Token cast to a u16, while SyntaxToken is unrelated. To keep ourselves from getting confused when we use Rowan later, we’ll rename our Token to SyntaxKind:

enum SyntaxKind {
    Whitespace,
    Error,
}

One of the key elements of the Rowan tree structure is that trees are deduplicated, i.e. if you construct two identical trees in your parser, only one copy is ever stored in memory, reducing memory usage.

Although it isn’t visible here, Rowan gives you access to parents and siblings if you have a SyntaxNode, SyntaxToken or SyntaxElement. This is especially convenient for writing, say, a linter with Rowan.

Back to the lexer

Let’s actually make the lexer lex text. This is where Logos comes into play. Here’s our first test:

pub(crate) enum SyntaxKind {
    Whitespace,
    Error,
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn lex_spaces() {
        let mut lexer = SyntaxKind::lexer("   ");

        assert_eq!(lexer.next(), Some(SyntaxKind::Whitespace));
        assert_eq!(lexer.slice(), "   ");
    }
}

Here we’re creating our lexer from the input we want to test, and then verify that the lexer has determined that the input starts with whitespace. We make sure that the entirety of the input has been classified as whitespace by checking that the text of the current lexeme (you can think of a lexeme as a (SyntaxKind, &str)) is equal to the original input.

For now, let’s define whitespace to be a series of one or more spaces, which can be represented by the regex / +/. Here’s how we tell this to Logos:

use logos::Logos;

#[derive(Debug, Copy, Clone, PartialEq, Logos)]
pub(crate) enum SyntaxKind {
    #[regex(" +")]
    Whitespace,

    #[error]
    Error,
}
$ cargo t -q
running 1 test
.
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Let’s also lex the fn keyword, used in function definitions. A test first:

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn lex_fn_keyword() {
        let mut lexer = SyntaxKind::lexer("fn");

        assert_eq!(lexer.next(), Some(SyntaxKind::FnKw));
        assert_eq!(lexer.slice(), "fn");
    }
}

Let’s now write the needed implementation:

#[derive(Debug, Copy, Clone, PartialEq, Logos)]
pub(crate) enum SyntaxKind {
    #[regex(" +")]
    Whitespace,

    #[token("fn")]
    FnKw,

    #[error]
    Error,
}
$ cargo t -q
running 2 tests
..
test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

That test code is getting repetitive though. Let’s extract the common parts into a helper function:

#[cfg(test)]
mod tests {
    use super::*;

    fn check(input: &str, kind: SyntaxKind) {
        let mut lexer = SyntaxKind::lexer(input);

        assert_eq!(lexer.next(), Some(kind));
        assert_eq!(lexer.slice(), input);
    }

    #[test]
    fn lex_spaces() {
        check("   ", SyntaxKind::Whitespace);
    }

    #[test]
    fn lex_fn_keyword() {
        check("fn", SyntaxKind::FnKw);
    }
}
$ cargo t -q
running 2 tests
..
test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

We can now zoom through lexing all the remaining lexical constructs:

Here’s the file in its entirety:

use logos::Logos;

#[derive(Debug, Copy, Clone, PartialEq, Logos)]
pub(crate) enum SyntaxKind {
    #[regex(" +")]
    Whitespace,

    #[token("fn")]
    FnKw,

    #[token("let")]
    LetKw,

    #[regex("[A-Za-z][A-Za-z0-9]+")]
    Ident,

    #[regex("[0-9]+")]
    Number,

    #[token("+")]
    Plus,

    #[token("-")]
    Minus,

    #[token("*")]
    Star,

    #[token("/")]
    Slash,

    #[token("=")]
    Equals,

    #[token("{")]
    LBrace,

    #[token("}")]
    RBrace,

    #[error]
    Error,
}

#[cfg(test)]
mod tests {
    use super::*;

    fn check(input: &str, kind: SyntaxKind) {
        let mut lexer = SyntaxKind::lexer(input);

        assert_eq!(lexer.next(), Some(kind));
        assert_eq!(lexer.slice(), input);
    }

    #[test]
    fn lex_spaces() {
        check("   ", SyntaxKind::Whitespace);
    }

    #[test]
    fn lex_fn_keyword() {
        check("fn", SyntaxKind::FnKw);
    }

    #[test]
    fn lex_let_keyword() {
        check("let", SyntaxKind::LetKw);
    }

    #[test]
    fn lex_alphabetic_identifier() {
        check("abcd", SyntaxKind::Ident);
    }

    #[test]
    fn lex_alphanumeric_identifier() {
        check("ab123cde456", SyntaxKind::Ident);
    }

    #[test]
    fn lex_mixed_case_identifier() {
        check("ABCdef", SyntaxKind::Ident);
    }

    #[test]
    fn lex_number() {
        check("123456", SyntaxKind::Number);
    }

    #[test]
    fn lex_plus() {
        check("+", SyntaxKind::Plus);
    }

    #[test]
    fn lex_minus() {
        check("-", SyntaxKind::Minus);
    }

    #[test]
    fn lex_star() {
        check("*", SyntaxKind::Star);
    }

    #[test]
    fn lex_slash() {
        check("/", SyntaxKind::Slash);
    }

    #[test]
    fn lex_equals() {
        check("=", SyntaxKind::Equals);
    }

    #[test]
    fn lex_left_brace() {
        check("{", SyntaxKind::LBrace);
    }

    #[test]
    fn lex_right_brace() {
        check("}", SyntaxKind::RBrace);
    }
}
$ cargo t -q
running 14 tests
..............
test result: ok. 14 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

A parser

// lib.rs

mod lexer;
mod parser;

The first thing our parser needs to hold is a lexer generated by Logos:

// src/parser.rs

use crate::lexer::SyntaxKind;

pub(crate) struct Parser<'a> {
    lexer: logos::Lexer<'a, SyntaxKind>,
}

That lifetime 'a is needed because Logos’ lexers store the input they were created with, and pull from it as tokens are requested. An approach more common than what we’re doing is to lex the entire input upon creating the parser, but we don’t need to do that: we can lex lazily as we go, rather than eagerly all at once. This is really a micro-optimisation, but it’s so cool and easy to do, and there’s no downside (for now).

Let’s add a constructor:

use crate::lexer::SyntaxKind;
use logos::Logos;

// snip

impl<'a> Parser<'a> {
    pub(crate) fn new(input: &'a str) -> Self {
        Self {
            lexer: SyntaxKind::lexer(input),
        }
    }
}

How can we construct a syntax tree, though? This is what we’re using Rowan for. Let’s add the relevant code so we can use Rowan first, and I’ll explain afterwards:

# Cargo.toml

# snip

[dependencies]
logos = "0.11.4"
rowan = "0.10.0"
// parser.rs

use crate::lexer::SyntaxKind;
use logos::Logos;
use rowan::GreenNodeBuilder;

pub(crate) struct Parser<'a> {
    lexer: logos::Lexer<'a, SyntaxKind>,
    builder: GreenNodeBuilder<'static>,
}

impl<'a> Parser<'a> {
    pub(crate) fn new(input: &'a str) -> Self {
        Self {
            lexer: SyntaxKind::lexer(input),
            builder: GreenNodeBuilder::new(),
        }
    }
}

Earlier in this post I explained the gist of Rowan’s syntax trees. Here’s how you create one: the builder remembers what branch of the tree you’re on. It exposes methods to create a new child branch with a given SyntaxKind and make it current, add a token to the current branch, as well as a method to finish the current branch and restore the parent branch as current. Finally, GreenNodeBuilder has a finish method, which outputs the syntax tree it has created.

finish requires the syntax tree to contain just one root node; we can fulfil this requirement by creating all our nodes as children of a single Root node. Let’s add this to SyntaxKind:

// lexer.rs

#[derive(Debug, Copy, Clone, PartialEq, Logos)]
pub(crate) enum SyntaxKind {
    // snip

    Root,
}

We can now start defining a parse method on Parser that creates a Root node:

// parser.rs

impl<'a> Parser<'a> {
    // snip

    pub(crate) fn parse(mut self) -> ? {
        self.builder.start_node(SyntaxKind::Root);
        self.builder.finish_node();
    }
}

What do we output from Parser::parse? Let’s define a new structure to hold the output of parsing:

use crate::lexer::SyntaxKind;
use logos::Logos;
use rowan::{GreenNode, GreenNodeBuilder};

// snip

impl<'a> Parser<'a> {
    // snip

    pub(crate) fn parse(mut self) -> Parse {
        self.builder.start_node(SyntaxKind::Root);
        self.builder.finish_node();

        Parse {
            green_node: self.builder.finish(),
        }
    }
}

pub(crate) struct Parse {
    green_node: GreenNode,
}

We now have an error when we’re trying to create our root node about GreenNodeBuilder::start_node expecting a rowan::SyntaxKind instead of our SyntaxKind. To solve this we’ll implement From<SyntaxKind> for rowan::SyntaxKind:

// lexer.rs

impl From<SyntaxKind> for rowan::SyntaxKind {
    fn from(kind: SyntaxKind) -> Self {
        Self(kind as u16)
    }
}

Note how Rowan never directly interacts with our SyntaxKind – instead, it’s casted to a u16 so that Rowan can work with any enum. Now that that’s done, we can add a call to .into() to convert our SyntaxKind into Rowan’s:

// parser.rs

impl<'a> Parser<'a> {
    // snip

    pub(crate) fn parse(mut self) -> Parse {
        self.builder.start_node(SyntaxKind::Root.into());
        self.builder.finish_node();

        Parse {
            green_node: self.builder.finish(),
        }
    }
}

So that we can write tests for our parser, we’ll need a way of displaying its output: the GreenNode inside of Parse. We can only do this by converting it to a SyntaxNode.

Remember how Rowan doesn’t actually store our SyntaxKind, instead storing its own rowan::SyntaxKind with our SyntaxKind casted to a u16? In order for Rowan to be able to give us our SyntaxKind instead of the ‘raw’ Rowan version when we, say, ask Rowan what the kind of a given node is, we need to also have our own version of SyntaxNode that knows how to convert between the two. Eventually we’ll also need our own SyntaxElement and SyntaxToken, but we don’t need those at the moment. The way Rowan automates the conversion is through a trait, namely rowan::Language.

Another rabbit hole

Once again, a new module is in order:

// lib.rs

mod lexer;
mod parser;
mod syntax;
// src/syntax.rs

use crate::lexer::SyntaxKind;

#[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq, Hash)]
pub(crate) enum EldiroLanguage {}

impl rowan::Language for EldiroLanguage {
    type Kind = SyntaxKind;

    fn kind_from_raw(raw: rowan::SyntaxKind) -> Self::Kind {
        todo!()
    }

    fn kind_to_raw(kind: Self::Kind) -> rowan::SyntaxKind {
        todo!()
    }
}

Converting from our SyntaxKind to Rowan’s is easy – we can re-use the From implementation from earlier:

impl rowan::Language for EldiroLanguage {
    // snip

    fn kind_to_raw(kind: Self::Kind) -> rowan::SyntaxKind {
        kind.into()
    }
}

Going the other way, though, is much more difficult – we have to convert from something with 65,536 possible values (u16) to one with fourteen (SyntaxKind). We could do it the manual way:

impl rowan::Language for EldiroLanguage {
    // snip

    fn kind_from_raw(raw: rowan::SyntaxKind) -> Self::Kind {
        match raw.0 {
            0 => Self::Kind::Whitespace,
            1 => Self::Kind::FnKw,
            2 => Self::Kind::LetKw,
            3 => Self::Kind::Ident,
            4 => Self::Kind::Number,
            // etc.
        }
    }

    // snip
}

This is tedious, error prone and makes it annoying to modify SyntaxKind in any way. We could also use some unsafe:

impl rowan::Language for EldiroLanguage {
    // snip

    fn kind_from_raw(raw: rowan::SyntaxKind) -> Self::Kind {
        assert!(raw.0 < 14);
        unsafe { std::mem::transmute(raw) }
    }

    // snip
}

This also works, but means we need to update this seemingly-magical number every time we add or remove a variant from SyntaxKind. rust-analyzer uses a hidden __LAST enum variant to determine how many variants their SyntaxKind has. Regardless, std::mem::transmute is about the most unsafe thing you can do, so I’d like to avoid it.

The approach we’ll use utilises a procedural macro to automate the process for us. Usually I’d hesitate before adding a procedural macro due to the compilation time overhead they incur; however, we’re already using a proc-macro – Logos – so we’re paying the cost of the procedural macro ecosystem whether we use this approach or not.

# Cargo.toml

# snip

[dependencies]
logos = "0.11.4"
num-derive = "0.3.3"
num-traits = "0.2.14"
rowan = "0.10.0"
// lexer.rs

use logos::Logos;
use num_derive::{FromPrimitive, ToPrimitive};

#[derive(Debug, Copy, Clone, PartialEq, Logos, FromPrimitive, ToPrimitive)]
pub(crate) enum SyntaxKind {
    // snip
}
// syntax.rs

use crate::lexer::SyntaxKind;
use num_traits::{FromPrimitive, ToPrimitive};

// snip

impl rowan::Language for EldiroLanguage {
    type Kind = SyntaxKind;

    fn kind_from_raw(raw: rowan::SyntaxKind) -> Self::Kind {
        Self::Kind::from_u16(raw.0).unwrap()
    }

    fn kind_to_raw(kind: Self::Kind) -> rowan::SyntaxKind {
        rowan::SyntaxKind(kind.to_u16().unwrap())
    }
}

This also means we can remove our custom From<SyntaxKind> for rowan::SyntaxKind impl.

Now we can define our own Eldiro-specific SyntaxNode:

pub(crate) type SyntaxNode = rowan::SyntaxNode<EldiroLanguage>;

Finally, the last error left before we can write tests for our parser is from our usage of .into() in Parser::parse. Let’s replace that with a usage of our EldiroLanguage, plus some helper methods to reduce typing:

use crate::lexer::SyntaxKind;
use crate::syntax::EldiroLanguage;
use logos::Logos;
use rowan::{GreenNode, GreenNodeBuilder, Language};

// snip

impl<'a> Parser<'a> {
    // snip

    pub(crate) fn parse(mut self) -> Parse {
        self.start_node(SyntaxKind::Root);
        self.finish_node();

        Parse {
            green_node: self.builder.finish(),
        }
    }

    fn start_node(&mut self, kind: SyntaxKind) {
        self.builder.start_node(EldiroLanguage::kind_to_raw(kind));
    }

    fn finish_node(&mut self) {
        self.builder.finish_node();
    }
}

Testing our parser

And now, finally, we can write the first test for our parser:

#[cfg(test)]
mod tests {
    use super::*;
    use crate::syntax::SyntaxNode;

    #[test]
    fn parse_nothing() {
        let parse = Parser::new("").parse();

        assert_eq!(
            format!("{:#?}", SyntaxNode::new_root(parse.green_node)),
            r#"Root@0..0
"#,
        );
    }
}

Doesn’t look too good, does it? It’s long, tedious to write, and – worst of all – doesn’t provide any sort of diff if the test fails. Instead, cargo t will spit out the syntax tree we expected and the one we got, and it’ll be up to us to Spot The Difference. I’ve tried using Pretty Assertions for this problem in the past, and although it works well for comparing structured data, it isn’t nice to work with when you’re using strings like we are here. Instead of showing the line-by-line diff, Pretty Assertions instead shows a diff as if the syntax tree is all on one line, with the lines separated by a literal \n.

We’ll use expect-test instead, which:

From the documentation:

This becomes very useful when you have a lot of tests with verbose and potentially changing expected output.

expect-test sounds perfect for this use case, and that’s no coincidence – it was made by the developers of rust-analyzer.

Let’s convert our lone test to use expect-test:

# Cargo.toml

# snip

[dev-dependencies]
expect-test = "1.0.1"
// parser.rs

#[cfg(test)]
mod tests {
    use super::*;
    use crate::syntax::SyntaxNode;
    use expect_test::{expect, Expect};

    fn check(input: &str, expected_tree: Expect) {
        let parse = Parser::new(input).parse();
        let syntax_node = SyntaxNode::new_root(parse.green_node);

        let actual_tree = format!("{:#?}", syntax_node);

        // We cut off the last byte because formatting the SyntaxNode adds on a newline at the end.
        expected_tree.assert_eq(&actual_tree[0..actual_tree.len() - 1]);
    }

    #[test]
    fn parse_nothing() {
        check("", expect![[r#"Root@0..0"#]]);
    }
}

Running our tests gives us a bunch of warnings, as well as fifteen happy little green dots:

$ cargo -t q
running 15 tests
...............
test result: ok. 15 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Cleaning up warnings

Let’s make a few things that we know will need to be used by the REPL pub now so that some of those ‘dead code’ warnings go away:

// lib.rs

pub mod parser;

mod lexer;
mod syntax;
// parser.rs

// snip

pub struct Parser<'a> {
    // snip
}

// snip

impl<'a> Parser<'a> {
    pub fn new(input: &'a str) -> Self {
        // snip
    }

    pub fn parse(mut self) -> Parse {
        // snip
    }

    // snip
}

pub struct Parse {
    // snip
}

While we’re at it we might as well paste in the REPL code from before the wipe of the project:

// crates/eldiro-cli/src/main.rs

use std::io::{self, Write};

fn main() -> io::Result<()> {
    let stdin = io::stdin();
    let mut stdout = io::stdout();
    let mut stderr = io::stderr();

    let mut input = String::new();
    let mut env = eldiro::Env::default();

    loop {
        write!(stdout, "→ ")?;
        stdout.flush()?;

        stdin.read_line(&mut input)?;

        match run(input.trim(), &mut env) {
            Ok(Some(val)) => writeln!(stdout, "{}", val)?,
            Ok(None) => {}
            Err(msg) => writeln!(stderr, "{}", msg)?,
        }

        input.clear();
    }
}

fn run(input: &str, env: &mut eldiro::Env) -> Result<Option<eldiro::Val>, String> {
    let parse = eldiro::parse(input).map_err(|msg| format!("Parse error: {}", msg))?;

    let evaluated = parse
        .eval(env)
        .map_err(|msg| format!("Evaluation error: {}", msg))?;

    if evaluated == eldiro::Val::Unit {
        Ok(None)
    } else {
        Ok(Some(evaluated))
    }
}

A lot of that code revolves around evaluating the user’s input, which we haven’t implemented yet. Let’s get rid of it and use our new parser too:

use eldiro::parser::Parser;
use std::io::{self, Write};

fn main() -> io::Result<()> {
    let stdin = io::stdin();
    let mut stdout = io::stdout();

    let mut input = String::new();

    loop {
        write!(stdout, "→ ")?;
        stdout.flush()?;

        stdin.read_line(&mut input)?;

        let parse = Parser::new(&input).parse();

        input.clear();
    }
}

How are we going to display the parser’s output? We can’t access parse.green_node like we can in our check test helper function. Let’s add a method to Parse that provides a debug representation of the parse tree:

// crates/eldiro/src/parser.rs

use crate::lexer::SyntaxKind;
use crate::syntax::{EldiroLanguage, SyntaxNode};
use logos::Logos;
use rowan::{GreenNode, GreenNodeBuilder, Language};

// snip

impl Parse {
    pub fn debug_tree(&self) -> String {
        let syntax_node = SyntaxNode::new_root(self.green_node.clone());
        let formatted = format!("{:#?}", syntax_node);

        // We cut off the last byte because formatting the SyntaxNode adds on a newline at the end.
        formatted[0..formatted.len() - 1].to_string()
    }
}

Let’s update check to use this:

#[cfg(test)]
mod tests {
    use super::*;
    use expect_test::{expect, Expect};

    fn check(input: &str, expected_tree: Expect) {
        let parse = Parser::new(input).parse();
        expected_tree.assert_eq(&parse.debug_tree());
    }

    // snip
}

We can now use Parse::debug_tree in our REPL:

// crates/eldiro-cli/src/main.rs

use eldiro::parser::Parser;
use std::io::{self, Write};

fn main() -> io::Result<()> {
    let stdin = io::stdin();
    let mut stdout = io::stdout();

    let mut input = String::new();

    loop {
        write!(stdout, "→ ")?;
        stdout.flush()?;

        stdin.read_line(&mut input)?;

        let parse = Parser::new(&input).parse();
        println!("{}", parse.debug_tree());

        input.clear();
    }
}

As expected, when we run the Eldiro CLI we get our REPL which, no matter the input, gives us a tree with a single Root node:

$ cargo r
→
Root@0..0
→ test
Root@0..0
→ fe;wjmfeoijfew;oifjes;oad
Root@0..0
→ 1+1
Root@0..0
→ ^C
$

Writing a parser that does things

As we just saw, our parser isn’t exactly interesting. Let’s parse numbers and binding usages first, since those are both straightforward and only involve a single token of input. First, a test for numbers:

// crates/eldiro/src/parser.rs

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn parse_number() {
        check(
            "123",
            expect![[r#"
Root@0..3
  Number@0..3 "123""#]],
        );
    }
}

Here we’re seeing Rowan’s tree format in full – children are successively indented by two spaces, and lexeme text is included after the text range in quotes. Feel free to run cargo t to see what expect-test’s diffs look like. They use colour, so I can’t reproduce them here.

Implementing parsing of numbers shows how Rowan and Logos combine:

impl<'a> Parser<'a> {
    // snip

    pub fn parse(mut self) -> Parse {
        self.start_node(SyntaxKind::Root);

        if self.lexer.next() == Some(SyntaxKind::Number) {
            self.builder.token(
                EldiroLanguage::kind_to_raw(SyntaxKind::Number),
                self.lexer.slice().into(),
            );
        }

        self.finish_node();

        Parse {
            green_node: self.builder.finish(),
        }
    }

    // snip
}

That call to Iterator::next isn’t ideal – it’s advancing the lexer, even if we don’t have a Number. This is perfectly fine in this case, but in most other cases we’ll only want to keep eating through the input if we are at a certain token. To be able to call next without going any further into the input, we’ll use the Peekable iterator adaptor:

use crate::lexer::SyntaxKind;
use crate::syntax::{EldiroLanguage, SyntaxNode};
use logos::Logos;
use rowan::{GreenNode, GreenNodeBuilder, Language};
use std::iter::Peekable;

pub struct Parser<'a> {
    lexer: Peekable<logos::Lexer<'a, SyntaxKind>>,
    builder: GreenNodeBuilder<'static>,
}

impl<'a> Parser<'a> {
    pub fn new(input: &'a str) -> Self {
        Self {
            lexer: SyntaxKind::lexer(input).peekable(),
            builder: GreenNodeBuilder::new(),
        }
    }

    // snip
}

This gives us an error, though, when we try to call .slice() in Parser::parse, since that method is defined on the logos::Lexer that Peekable is wrapping, but not Peekable itself. Unfortunately, the only way to solve this is to either:

We’ll need to do the second one later anyway, so we might as well go down that route.

// lexer.rs

pub(crate) struct Lexer<'a> {
    inner: logos::Lexer<'a, SyntaxKind>,
}

impl<'a> Lexer<'a> {
    pub(crate) fn new(input: &'a str) -> Self {
        Self {
            inner: SyntaxKind::lexer(input),
        }
    }
}

impl<'a> Iterator for Lexer<'a> {
    type Item = (SyntaxKind, &'a str);

    fn next(&mut self) -> Option<Self::Item> {
        let kind = self.inner.next()?;
        let text = self.inner.slice();

        Some((kind, text))
    }
}

To make sure the lexer is working we can modify the check helper function to use our custom lexer wrapper instead of the standard Logos one:

#[cfg(test)]
mod tests {
    // snip

    fn check(input: &str, kind: SyntaxKind) {
        let mut lexer = Lexer::new(input);
        assert_eq!(lexer.next(), Some((kind, input)));
    }

    // snip
}
$ cargo t -q
running 16 tests
................
test result: ok. 16 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

We can now modify Parser to use our new lexer:

// parser.rs

use crate::lexer::{Lexer, SyntaxKind};
use crate::syntax::{EldiroLanguage, SyntaxNode};
use rowan::{GreenNode, GreenNodeBuilder, Language};
use std::iter::Peekable;

pub struct Parser<'a> {
    lexer: Peekable<Lexer<'a>>,
    builder: GreenNodeBuilder<'static>,
}

impl<'a> Parser<'a> {
    pub fn new(input: &'a str) -> Self {
        Self {
            lexer: Lexer::new(input).peekable(),
            builder: GreenNodeBuilder::new(),
        }
    }

    // snip
}

Let’s introduce a peek method to ‘peek’ at what the next SyntaxKind is:

impl<'a> Parser<'a> {
    // snip

    pub fn parse(mut self) -> Parse {
        self.start_node(SyntaxKind::Root);

        if self.peek() == Some(SyntaxKind::Number) {
            self.builder.token(
                EldiroLanguage::kind_to_raw(SyntaxKind::Number),
                self.lexer.slice().into(),
            );
        }

        self.finish_node();

        Parse {
            green_node: self.builder.finish(),
        }
    }

    // snip

    fn peek(&mut self) -> Option<SyntaxKind> {
        self.lexer.peek().map(|(kind, _)| *kind)
    }
}

Let’s also introduce a bump method that adds the lexeme the lexer is currently at to the current branch of the parse tree:

impl<'a> Parser<'a> {
    // snip

    pub fn parse(mut self) -> Parse {
        self.start_node(SyntaxKind::Root);

        if self.peek() == Some(SyntaxKind::Number) {
            self.bump();
        }

        self.finish_node();

        Parse {
            green_node: self.builder.finish(),
        }
    }

    // snip

    fn bump(&mut self) {
        let (kind, text) = self.lexer.next().unwrap();

        self.builder
            .token(EldiroLanguage::kind_to_raw(kind), text.into());
    }

    // snip
}

Take note of the .unwrap() in Parser::bump; if we try to add the current lexeme to the parse tree when we’re actually at the end of the input, then we end up panicking. In my opinion this is reasonable, since it’s an error on the behalf of the programmer, and cannot be recovered from.

$ cargo t -q
running 16 tests
................
test result: ok. 16 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

After all that refactoring and refinement, it’s trivial to parse binding usages. Here’s the test:

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn parse_binding_usage() {
        check(
            "counter",
            expect![[r#"
Root@0..7
  Ident@0..7 "counter""#]],
        );
    }
}

And here’s how we can implement it:

impl<'a> Parser<'a> {
    // snip

    pub fn parse(mut self) -> Parse {
        self.start_node(SyntaxKind::Root);

        match self.peek() {
            Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => self.bump(),
            _ => {}
        }

        self.finish_node();

        Parse {
            green_node: self.builder.finish(),
        }
    }

    // snip
}
$ cargo t -q
running 17 tests
.................
test result: ok. 17 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Operator precedence

Now that we’ve gotten to know Rowan a bit better and set up a basic project using it, we can start on the main goal of this post: operator precedence. Up to now, Eldiro has not had operator precedence in any shape or form – we’ve always disallowed nested operations. A big thanks to this fantastic post by matklad; it’s the best explanation of the algorithm we’ll be using here I’ve seen. Speaking of the algorithm we’re using, it’s called the ‘Pratt parsing algorithm’.

If we want to parse a mathematical operation, we’ll need to store it in our Rowan-based syntax tree, which means we’ll also need to create a SyntaxKind variant to differentiate mathematical operations (also known as binary operations due to them having two operands) from all the other node types we’ll have:

// lexer.rs

// snip

#[derive(Debug, Copy, Clone, PartialEq, Logos, FromPrimitive, ToPrimitive)]
pub(crate) enum SyntaxKind {
    // snip

    Root,
    BinOp,
}

Let’s write a function that parses a single expression. At the moment this will only parse numbers and binding usages:

// parser.rs

mod expr;

use crate::lexer::{Lexer, SyntaxKind};
use crate::syntax::{EldiroLanguage, SyntaxNode};
use expr::expr;
use rowan::{GreenNode, GreenNodeBuilder, Language};
use std::iter::Peekable;

// snip

impl<'a> Parser<'a> {
    // snip

    pub fn parse(mut self) -> Parse {
        self.start_node(SyntaxKind::Root);

        expr(&mut self);

        self.finish_node();

        Parse {
            green_node: self.builder.finish(),
        }
    }

    // snip
}
// expr.rs

use super::Parser;
use crate::lexer::SyntaxKind;

fn expr(p: &mut Parser) {
    match p.peek() {
        Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => p.bump(),
        _ => {}
    }
}

All we’ve done here is move some of the code from Parser::parse into its own function. In my opinion tests should be next to the code they’re testing, so let’s migrate the relevant tests from crate::parser to crate::parser::expr:

// parser.rs

#[cfg(test)]
fn check(input: &str, expected_tree: expect_test::Expect) {
    let parse = Parser::new(input).parse();
    expected_tree.assert_eq(&parse.debug_tree());
}

#[cfg(test)]
mod tests {
    use super::*;
    use expect_test::expect;

    #[test]
    fn parse_nothing() {
        check("", expect![[r#"Root@0..0"#]]);
    }
}
// expr.rs

#[cfg(test)]
mod tests {
    use super::super::check;
    use expect_test::expect;

    #[test]
    fn parse_number() {
        check(
            "123",
            expect![[r#"
Root@0..3
  Number@0..3 "123""#]],
        );
    }

    #[test]
    fn parse_binding_usage() {
        check(
            "counter",
            expect![[r#"
Root@0..7
  Ident@0..7 "counter""#]],
        );
    }
}

Now that we’re parsing single expressions isolated from the rest of the parser, we need to parse operators:

pub(super) fn expr(p: &mut Parser) {
    match p.peek() {
        Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => p.bump(),
        _ => {}
    }

    match p.peek() {
        Some(SyntaxKind::Plus)
        | Some(SyntaxKind::Minus)
        | Some(SyntaxKind::Star)
        | Some(SyntaxKind::Slash) => p.bump(),
        _ => {}
    }
}

To determine precedence we need to have some representation of an operator, since that lets us ask what the precedence of a given operator is:

enum Op {
    Add,
    Sub,
    Mul,
    Div,
}

The representation of precedence Pratt parsing uses is called binding power – the higher the binding power, the ‘tighter’ the operator holds on to its operands. For example, multiplication has a higher binding power than addition.

Keep in mind that there are left-associative and right-associative operators. Here’s how 1 - 2 - 3 - 4 is parsed if we assume - to be left-associative:

((1 - 2) - 3) - 4

And here’s how it’s parsed if - is right-associative:

1 - (2 - (3 - 4))

All of the standard four binary operators are conventionally left-associative. Examples of conventionally right-associative operators include exponentiation (2^3^4 is parsed as 2^(3^4)) and function composition. To convey associativity, Pratt parsing includes different binding powers for the left and the right side of the operator.

Here’s the conventional binding power table:

impl Op {
    fn binding_power(&self) -> (u8, u8) {
        match self {
            Self::Add | Self::Sub => (1, 2),
            Self::Mul | Self::Div => (3, 4),
        }
    }
}

I was dumbfounded when I first saw this; how could all these operators possibly be left-associative if their right binding powers are all higher than their left binding powers?! Surely, if you want to make something left-associative you increase its left binding power? This is not the case – let me see if I can explain. Here’s a simple example of binary operations:

1   +   2   +   3   +   4

Here’s what happens if we put the relevant binding power next to each operator:

1   +   2   +   3   +   4
  1   2   1   2   1   2

Let’s now move the operators and numbers together that have the strongest connection (in this case the ones that are separated by 2 instead of 1):

1   + 2   + 3   + 4
  1  2  1  2  1  2

Take a look at how the intuitive ‘if the right binding power is higher, then it’s right-associative’ idea doesn’t work; + can’t be right associative, since we can’t rip that 3 away from the + so that it can join together in a binary operation with the 4. The only possible way to resolve this is to connect the 1 to the + 2, then successively add on the + 3 and the + 4 in increasingly lower levels of precedence.

((1   + 2)   + 3)   + 4
    1  2   1  2   1  2

This is all well and good, but how do we write this in code? The solution is devilishly simple: we consume a single expression (at this point just a number or binding usage), and then start a loop. We parse out an operator, and based on that figure out what the left and right binding powers of the operator are. We then wrap everything we’ve parsed so far (the expression and the operator) in a new node, recurse, and finally finish that node.

Now here comes the really cool part:

This function actually takes a parameter, min_binding_power. Before we recurse, we check if the left binding power of the operator we’ve parsed is less than min_binding_power. If it is, then we can return. Oh, and also, when we recurse, the value we set min_binding_power to is the operator’s right binding power. This leads to some very interesting behaviour: the parser keeps recursing as long as the binding powers are getting higher and higher. Once the parser has found an operator with a lower binding power than what we’ve had so far, the function returns, which means the caller finishes their node and continues trying to parse operators with higher binding powers. It takes a while to fully understand how this works (at least it did for me), but I hope my explanation is enough to make it ‘click’ for you.

Let’s build up the code bit by bit; we already have a function to parse an expression, and we also have operators and their binding powers set up. First, we’ll determine which particular operator we’re looking at:

pub(super) fn expr(p: &mut Parser) {
    match p.peek() {
        Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => p.bump(),
        _ => {}
    }

    let op = match p.peek() {
        Some(SyntaxKind::Plus) => Op::Add,
        Some(SyntaxKind::Minus) => Op::Sub,
        Some(SyntaxKind::Star) => Op::Mul,
        Some(SyntaxKind::Slash) => Op::Div,
        _ => return, // we’ll handle errors later.
    };
    p.bump();
}

Other code parsing expressions shouldn’t have to worry about the implementation details of operator precedence, so we should avoid adding the min_binding_power parameter directly to expr. Instead, we’ll rename expr to expr_binding_power and make expr call expr_binding_power with min_binding_power set to 0 (since there isn’t a minimum when you haven’t parsed anything yet):

pub(super) fn expr(p: &mut Parser) {
    expr_binding_power(p, 0);
}

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
    match p.peek() {
        Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => p.bump(),
        _ => {}
    }

    let op = match p.peek() {
        Some(SyntaxKind::Plus) => Op::Add,
        Some(SyntaxKind::Minus) => Op::Sub,
        Some(SyntaxKind::Star) => Op::Mul,
        Some(SyntaxKind::Slash) => Op::Div,
        _ => return, // we’ll handle errors later.
    };
    p.bump();
}

We can now get the operator’s binding powers and compare them to minimum_binding_power:

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
    match p.peek() {
        Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => p.bump(),
        _ => {}
    }

    let op = match p.peek() {
        Some(SyntaxKind::Plus) => Op::Add,
        Some(SyntaxKind::Minus) => Op::Sub,
        Some(SyntaxKind::Star) => Op::Mul,
        Some(SyntaxKind::Slash) => Op::Div,
        _ => return, // we’ll handle errors later.
    };

    let (left_binding_power, right_binding_power) = op.binding_power();

    if left_binding_power < minimum_binding_power {
        return;
    }

    // Eat the operator’s token.
    p.bump();
}

Let’s add the loop and recursion:

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
    match p.peek() {
        Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => p.bump(),
        _ => {}
    }

    loop {
        let op = match p.peek() {
            Some(SyntaxKind::Plus) => Op::Add,
            Some(SyntaxKind::Minus) => Op::Sub,
            Some(SyntaxKind::Star) => Op::Mul,
            Some(SyntaxKind::Slash) => Op::Div,
            _ => return, // we’ll handle errors later.
        };

        let (left_binding_power, right_binding_power) = op.binding_power();

        if left_binding_power < minimum_binding_power {
            return;
        }

        // Eat the operator’s token.
        p.bump();

        expr_binding_power(p, right_binding_power);
    }
}

All that’s left now is to add the wrapping of nodes each time expr_binding_power is called. Rowan allows us to do this with ease by using something called a checkpoint. We create a checkpoint, and can then later call start_node_at on our GreenNodeBuilder, passing in the checkpoint as a parameter. Everything we’ve added to the current branch since the creation of the checkpoint will be ‘indented’ under a new node. As usual, we’ll add helper methods to Parser so we don’t have to directly interact with the GreenNodeBuilder:

// parser.rs

mod expr;

use crate::lexer::{Lexer, SyntaxKind};
use crate::syntax::{EldiroLanguage, SyntaxNode};
use expr::expr;
use rowan::{Checkpoint, GreenNode, GreenNodeBuilder, Language};
use std::iter::Peekable;

// snip

impl<'a> Parser<'a> {
    // snip

    fn start_node_at(&mut self, checkpoint: Checkpoint, kind: SyntaxKind) {
        self.builder
            .start_node_at(checkpoint, EldiroLanguage::kind_to_raw(kind));
    }

    // snip

    fn checkpoint(&self) -> Checkpoint {
        self.builder.checkpoint()
    }

    // snip
}
// expr.rs

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
    let checkpoint = p.checkpoint();

    match p.peek() {
        Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => p.bump(),
        _ => {}
    }

    loop {
        let op = match p.peek() {
            Some(SyntaxKind::Plus) => Op::Add,
            Some(SyntaxKind::Minus) => Op::Sub,
            Some(SyntaxKind::Star) => Op::Mul,
            Some(SyntaxKind::Slash) => Op::Div,
            _ => return, // we’ll handle errors later.
        };

        let (left_binding_power, right_binding_power) = op.binding_power();

        if left_binding_power < minimum_binding_power {
            return;
        }

       // Eat the operator’s token.
        p.bump();

        p.start_node_at(checkpoint, SyntaxKind::BinOp);
        expr_binding_power(p, right_binding_power);
        p.finish_node();
    }
}

And just like that, we’ve implemented bare-minimum operator precedence! Let’s write a few tests to make sure it works:

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn parse_simple_binary_operation() {
        check(
            "1+2",
            expect![[r#"
Root@0..3
  BinOp@0..3
    Number@0..1 "1"
    Plus@1..2 "+"
    Number@2..3 "2""#]],
        );
    }

    #[test]
    fn parse_left_associative_binary_operation() {
        check(
            "1+2+3+4",
            expect![[r#"
Root@0..7
  BinOp@0..7
    BinOp@0..5
      BinOp@0..3
        Number@0..1 "1"
        Plus@1..2 "+"
        Number@2..3 "2"
      Plus@3..4 "+"
      Number@4..5 "3"
    Plus@5..6 "+"
    Number@6..7 "4""#]],
        );
    }

    #[test]
    fn parse_binary_operation_with_mixed_binding_power() {
        check(
            "1+2*3-4",
            expect![[r#"
Root@0..7
  BinOp@0..7
    BinOp@0..5
      Number@0..1 "1"
      Plus@1..2 "+"
      BinOp@2..5
        Number@2..3 "2"
        Star@3..4 "*"
        Number@4..5 "3"
    Minus@5..6 "-"
    Number@6..7 "4""#]],
        );
    }
}

As you can tell, these tests get quite long and can be difficult to follow at times, so I try to keep them as short as possible. I still recommend thoroughly reading through one or two of them to make sure you understand what the parser is doing.

Congratulations on making it to the end! That’s it for this part; let me know what you think of the length, and if I should have broken this one up into several parts. In the next part we’ll fix two glaring holes in our operator precedence logic: lack of prefix operator support (e.g. -10) and lack of parentheses to control the order of operations.