Make A Language

Part One: A basic parser

September 8, 2020

The most fundamental part of any language is the parser1 – a piece of software whose purpose is to take a flat structure (usually text in some form) and convert it into a tree structure. In this post, we’ll make a parser for mathematical expressions that don’t contain nesting. For example, 1 + 1 is allowed, but 2 * 3 + 4 isn’t (because that’s shorthand for (2 * 3) + 4, and thus contains nesting). I’ve chosen to keep it this simple so that we don’t need to worry about order of operations.2

Let’s think about the problem: the only two elements that are allowed in our expressions are numbers and operators. If we can make a parser for numbers and a parser for operators, we’ll be well on our way to finishing the first incarnation of our parser.

A parser for numbers

As we’re good test-driven developers, we’ll start by writing a test. First, we open src/lib.rs and are greeted by this:

#[cfg(test)]
mod tests {
    #[test]
    fn it_works() {
        assert_eq!(2 + 2, 4);
    }
}

Let’s remove the it_works test, and change it to a test that checks if we can parse numbers:

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn parse_number() {
      assert_eq!(Number::new("123"), Number(123));
    }
}

Let’s follow TDD some more: what’s the shortest thing we can write that makes this pass? Well, first, we need to make it compile.

#[derive(Debug, PartialEq)]
pub struct Number(pub i32);

impl Number {
    pub fn new(s: &str) -> Self {}
}

This still doesn’t compile, because new doesn’t return anything. Well, the easiest thing we can do is use the parse method on str:

impl Number {
    pub fn new(s: &str) -> Self {
        Self(s.parse().unwrap())
    }
}

Now it compiles! Let’s see if the tests pass:

$ cargo t
   Compiling eldiro v0.1.0 (/home/me/src/eldiro)
    Finished test [unoptimized + debuginfo] target(s) in 0.14s
     Running /home/me/.cache/cargo-target/debug/deps/eldiro-4b82c3c57a78933f

running 1 test
test tests::parse_number ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

   Doc-tests eldiro

running 0 tests

Hooray, it works!

A parser for operators

We’ll start by writing tests for addition, subtraction, multiplication and division:

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn parse_add_op() {
        assert_eq!(Op::new("+"), Op::Add);
    }

    #[test]
    fn parse_sub_op() {
        assert_eq!(Op::new("-"), Op::Sub);
    }

    #[test]
    fn parse_mul_op() {
        assert_eq!(Op::new("*"), Op::Mul);
    }

    #[test]
    fn parse_div_op() {
        assert_eq!(Op::new("/"), Op::Div);
    }
}

Op is undefined, so let’s declare that:

#[derive(Debug, PartialEq)]
pub enum Op {
    Add,
    Sub,
    Mul,
    Div,
}

Now the only remaining compile error is that Op::new doesn’t exist:

impl Op {
    pub fn new(s: &str) -> Self {
        match s {
            "+" => Self::Add,
            "-" => Self::Sub,
            "*" => Self::Mul,
            "/" => Self::Div,
            _ => panic!("bad operator"),
        }
    }
}

Instead of handling the case where we’ve been given a non-existent operator, we just crash the program. Obviously this is far from ideal, but we’re keeping it really simple at the moment.

$ cargo t
// snip
running 5 tests
test tests::parse_add_op ... ok
test tests::parse_div_op ... ok
test tests::parse_mul_op ... ok
test tests::parse_sub_op ... ok
test tests::parse_number ... ok

test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
// doc-test stuff that we’re not interested in

Nice! We’ve now got the ingredients we need to parse our ultra-simplified mathematical expressions.

Putting it all together

As usual, we’ll write a test:

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn parse_one_plus_two() {
        assert_eq!(
            Expr::new("1+2"),
            Expr {
                lhs: Number(1),
                rhs: Number(2),
                op: Op::Add,
            },
        );
    }
}

In case you’re not familiar with the jargon, Expr is a common shorthand for Expression, lhs is short for left-hand side, and, shockingly, rhs stands for right-hand side.

Expr will be the type that holds the structure of what an expression actually is in Eldiro. Since we don’t support nesting, Expr’s definition is straightforward:

#[derive(Debug, PartialEq)]
pub struct Expr {
    pub lhs: Number,
    pub rhs: Number,
    pub op: Op,
}

Let’s tentatively define the new method:

impl Expr {
    pub fn new(s: &str) -> Self {
        let lhs = Number::new(s);
        let rhs = Number::new(s);
        let op = Op::new(s);

        Self { lhs, rhs, op }
    }
}

This won’t work, though, because we’re trying to parse the left-hand side (as well as the right-hand side and the operator) from the entire input, when in reality parsing the left-hand side only needs to look at a small portion at the start of the input. The same goes for the right-hand side and the operator, too. Let’s run the tests to see if we’re correct:

$ cargo t
running 6 tests
test tests::parse_add_op ... ok
test tests::parse_div_op ... ok
test tests::parse_number ... ok
test tests::parse_mul_op ... ok
test tests::parse_sub_op ... ok
test tests::parse_one_plus_two ... FAILED

failures:

---- tests::parse_one_plus_two stdout ----
thread 'tests::parse_one_plus_two' panicked at 'called `Result::unwrap()` on an `Err` value: ParseIntError { kind: InvalidDigit }', src/lib.rs:6:24

And sure enough, the test fails. Hmm …

So, what we need to do is … extract the numbers at the start of the input? Something like this:

assert_eq!(extractor_thing("1+2"), "1");

Well, if we’ve now got the left-hand side as its own &str, we need some way to get the operator. I mean, if we just scanned past the left-hand side, why don’t we continue from where we left off? To do that, extractor_thing will have to return a pair of the part of the input it extracted, as well as the leftovers:

//                       extracted portion ╮
//                          leftover ╮     │
assert_eq!(extractor_thing("1+2"), ("+2", "1"));

Let’s get started on writing this extractor_thing in a new module called utils. Create the file src/utils.rs and add mod utils; to the top of lib.rs.

// utils.rs
pub(crate) fn extract_digits(s: &str) -> (&str, &str) {
    (&s[1..], &s[0..1])
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn extract_one_digit() {
        assert_eq!(extract_digits("1+2"), ("+2", "1"));
    }
}

Well, the tests pass, but we’re not thinking ahead; what if the number is more than one digit long?

pub(crate) fn extract_digits(s: &str) -> (&str, &str) {
    let mut digits_end = 0;

    for (idx, c) in s.char_indices() {
        if c.is_ascii_digit() {
            digits_end = idx + 1;
        } else {
            break;
        }
    }

    let digits = &s[..digits_end];
    let remainder = &s[digits_end..];
    (remainder, digits)
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn extract_one_digit() {
        assert_eq!(extract_digits("1+2"), ("+2", "1"));
    }

    #[test]
    fn extract_multiple_digits() {
        assert_eq!(extract_digits("10-20"), ("-20", "10"));
    }
}

This works, but isn’t very idiomatic, with that for loop and mutation. Let’s rewrite it to use iterator adaptors:

pub(crate) fn extract_digits(s: &str) -> (&str, &str) {
    let digits_end = s
        .char_indices()
        .find_map(|(idx, c)| if c.is_ascii_digit() { None } else { Some(idx) })
        .unwrap_or_else(|| s.len());

    let digits = &s[..digits_end];
    let remainder = &s[digits_end..];
    (remainder, digits)
}

What we’re doing here is looking at all the characters (and their respective indices) in s – if the character is an ASCII digit (i.e. anywhere from 0 to 9) we move onto the next one. However, if the character isn’t an ASCII digit, we stop there and return the index of that character. On the off-chance that we never encounter a non-digit character, the index where the digits end is the length of the input (as the entire input consists of digits). We should probably add some tests to verify that:

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn do_not_extract_anything_from_empty_input() {
        assert_eq!(extract_digits(""), ("", ""));
    }

    #[test]
    fn extract_digits_with_no_remainder() {
        assert_eq!(extract_digits("100"), ("", "100"));
    }
}

And, indeed, that does work:

$ cargo t
running 9 tests
test tests::parse_add_op ... ok
test tests::parse_number ... ok
test tests::parse_sub_op ... ok
test tests::parse_div_op ... ok
test tests::parse_mul_op ... ok
test utils::tests::do_not_extract_anything_from_empty_input ... ok
test utils::tests::extract_multiple_digits ... ok
test utils::tests::extract_one_digit ... ok
test utils::tests::extract_digits_with_no_remainder ... ok
test tests::parse_one_plus_two ... FAILED

Huh, what’s that failed test there? Oh, right, we were trying to parse "1+2".

Remember this?

// lib.rs
impl Expr {
    pub fn new(s: &str) -> Self {
        let lhs = Number::new(s);
        let rhs = Number::new(s);
        let op = Op::new(s);

        Self { lhs, rhs, op }
    }
}

Let’s convert it to use our new extract_digits function:

impl Expr {
    pub fn new(s: &str) -> Self {
        let (s, lhs) = utils::extract_digits(s);
        let lhs = Number::new(lhs);

        let (s, rhs) = utils::extract_digits(s);
        let rhs = Number::new(rhs);

        // Uhhh ...
        let op = Op::new(s);

        Self { lhs, rhs, op }
    }
}

This won’t do, this won’t do at all! Not only are we parsing the operator after the right-hand side (even though it should come before), but we also haven’t applied the extraction logic we used for numbers to parsing operators. I guess we should head back to utils.rs and write up another function.

pub(crate) fn extract_op(s: &str) -> (&str, &str) {
    match &s[0..1] {
        "+" | "-" | "*" | "/" => {}
        _ => panic!("bad operator"),
    }

    (&s[1..], &s[0..1])
}

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn extract_plus() {
        assert_eq!(extract_op("+2"), ("2", "+"));
    }

    #[test]
    fn extract_minus() {
        assert_eq!(extract_op("-10"), ("10", "-"));
    }

    #[test]
    fn extract_star() {
        assert_eq!(extract_op("*3"), ("3", "*"));
    }

    #[test]
    fn extract_slash() {
        assert_eq!(extract_op("/4"), ("4", "/"));
    }
}

Let’s apply this in Expr::new:

impl Expr {
    pub fn new(s: &str) -> Self {
        let (s, lhs) = utils::extract_digits(s);
        let lhs = Number::new(lhs);

        let (s, op) = utils::extract_op(s);
        let op = Op::new(op);

        let (s, rhs) = utils::extract_digits(s);
        let rhs = Number::new(rhs);

        Self { lhs, rhs, op }
    }
}

Every time we extract something from the input, we declare a new binding with the same name as the initial input s so that everything we do with s afterwards doesn’t include the part we ‘consumed’.

Really, though, shouldn’t Number::new and Op::new handle extracting the relevant text from the input themselves? Let’s update the tests:

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn parse_number() {
        assert_eq!(Number::new("123"), ("", Number(123)));
    }

    #[test]
    fn parse_add_op() {
        assert_eq!(Op::new("+"), ("", Op::Add));
    }

    #[test]
    fn parse_sub_op() {
        assert_eq!(Op::new("-"), ("", Op::Sub));
    }

    #[test]
    fn parse_mul_op() {
        assert_eq!(Op::new("*"), ("", Op::Mul));
    }

    #[test]
    fn parse_div_op() {
        assert_eq!(Op::new("/"), ("", Op::Div));
    }
}

Since all our tests completely consume their inputs, there’s never any leftover input. To represent this, we’ve wrapped the expected output of each parser in a tuple where the first item is an empty string literal.

Let’s update all our parsers to independently extract text from the input:

impl Number {
    pub fn new(s: &str) -> (&str, Self) {
        let (s, number) = utils::extract_digits(s);
        (s, Self(number.parse().unwrap()))
    }
}

// snip

impl Op {
    pub fn new(s: &str) -> (&str, Self) {
        let (s, op) = utils::extract_op(s);

        let op = match op {
            "+" => Self::Add,
            "-" => Self::Sub,
            "*" => Self::Mul,
            "/" => Self::Div,
            _ => unreachable!(),
        };

        (s, op)
    }
}

// snip

impl Expr {
    pub fn new(s: &str) -> Self {
        let (s, lhs) = Number::new(s);
        let (s, op) = Op::new(s);
        let (s, rhs) = Number::new(s);

        Self { lhs, rhs, op }
    }
}

Look at how clean Expr::new is now! But we still have one problem: what if Expr::new doesn’t fully consume its input? Let’s make Expr::new return a (&str, Self) like all the other parsers:

impl Expr {
    pub fn new(s: &str) -> (&str, Self) {
        let (s, lhs) = Number::new(s);
        let (s, op) = Op::new(s);
        let (s, rhs) = Number::new(s);

        (s, Self { lhs, rhs, op })
    }
}

// snip

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn parse_one_plus_two() {
        assert_eq!(
            Expr::new("1+2"),
            (
                "",
                Expr {
                    lhs: Number(1),
                    rhs: Number(2),
                    op: Op::Add,
                },
            ),
        );
    }
}

Let’s finish off this instalment of Make A Language by basking in the glory of a passing test suite.

$ cargo t
   Compiling eldiro v0.1.0 (/home/me/src/eldiro)
    Finished test [unoptimized + debuginfo] target(s) in 0.46s
     Running /home/me/.cache/cargo-target/debug/deps/eldiro-4b82c3c57a78933f

running 14 tests
test tests::parse_add_op ... ok
test tests::parse_number ... ok
test tests::parse_div_op ... ok
test tests::parse_mul_op ... ok
test utils::tests::extract_minus ... ok
test utils::tests::extract_multiple_digits ... ok
test tests::parse_sub_op ... ok
test utils::tests::extract_one_digit ... ok
test tests::parse_one_plus_two ... ok
test utils::tests::extract_digits_with_no_remainder ... ok
test utils::tests::do_not_extract_anything_from_empty_input ... ok
test utils::tests::extract_plus ... ok
test utils::tests::extract_slash ... ok
test utils::tests::extract_star ... ok

test result: ok. 14 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

   Doc-tests eldiro

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Beautiful! Next time, we’ll add support for whitespace.


  1. For most languages it’s actually the lexer, but we’ll worry about that when we improve the parser in a future post. ↩︎

  2. If you really want to learn about operator precedence and how you can handle it in a parser, take a look at this great article. Yet again, I will only cover this topic once we’ve rewritten the parser. ↩︎