Part Eleven: Refinements

We’ve got a number of topics to cover this time, so let’s get started.

A bug fix

u/mozjag pointed out on Reddit that the regex we’re using to lex identifiers mandates that they have a length of two, when we could also allow identifiers with a single character. Let’s write a test for lexing a single-letter identifier to verify the bug is actually present:

// lexer.rs

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn lex_single_char_identifier() {
        check("x", SyntaxKind::Ident);
    }

    // snip
}
$ cargo t -q
running 21 tests
..............F......
failures:

---- lexer::tests::lex_single_char_identifier stdout ----
thread 'lexer::tests::lex_single_char_identifier' panicked at 'assertion failed: `(left == right)`
  left: `Some((Error, "x"))`,
 right: `Some((Ident, "x"))`', crates/eldiro/src/lexer.rs:78:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    lexer::tests::lex_single_char_identifier

test result: FAILED. 20 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out

We need to change the + in the Ident regex to be a * instead, so that zero or more repetitions are allowed instead of one or more:

#[derive(Debug, Copy, Clone, PartialEq, Logos, FromPrimitive, ToPrimitive)]
pub(crate) enum SyntaxKind {
    // snip

    #[regex("[A-Za-z][A-Za-z0-9]*")]
    Ident,

    // snip
}
$ cargo t -q
.....................
test result: ok. 21 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Negation

Up to now, all the operators we support have been binary, meaning that they have two operands. We’d like to have unary operators too, which are operators that have one operand. The operator we’ll add here is the prefix unary minus, as used for negation – the - in -10, for example. To start, we’ll need another node type:1

#[derive(Debug, Copy, Clone, PartialEq, Logos, FromPrimitive, ToPrimitive)]
pub(crate) enum SyntaxKind {
    // snip

    Root,
    BinOp,
    PrefixExpr,
}

Let’s rename BinOp to BinaryExpr for consistency (after all, we aren’t abbreviating Prefix):

#[derive(Debug, Copy, Clone, PartialEq, Logos, FromPrimitive, ToPrimitive)]
pub(crate) enum SyntaxKind {
    // snip

    Root,
    BinaryExpr,
    PrefixExpr,
}
// expr.rs

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
    // snip

    loop {
        // snip
        p.start_node_at(checkpoint, SyntaxKind::BinaryExpr);
        // snip
    }
}

// snip

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn parse_simple_binary_expression() {
        check(
            "1+2",
            expect![[r#"
Root@0..3
  BinaryExpr@0..3
    Number@0..1 "1"
    Plus@1..2 "+"
    Number@2..3 "2""#]],
        );
    }

    #[test]
    fn parse_left_associative_binary_expression() {
        check(
            "1+2+3+4",
            expect![[r#"
Root@0..7
  BinaryExpr@0..7
    BinaryExpr@0..5
      BinaryExpr@0..3
        Number@0..1 "1"
        Plus@1..2 "+"
        Number@2..3 "2"
      Plus@3..4 "+"
      Number@4..5 "3"
    Plus@5..6 "+"
    Number@6..7 "4""#]],
        );
    }

    #[test]
    fn parse_binary_expression_with_mixed_binding_power() {
        check(
            "1+2*3-4",
            expect![[r#"
Root@0..7
  BinaryExpr@0..7
    BinaryExpr@0..5
      Number@0..1 "1"
      Plus@1..2 "+"
      BinaryExpr@2..5
        Number@2..3 "2"
        Star@3..4 "*"
        Number@4..5 "3"
    Minus@5..6 "-"
    Number@6..7 "4""#]],
        );
    }
}

We can write a test for parsing an expression with the negation operator:

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn parse_negation() {
        check(
            "-10",
            expect![[r#"
Root@0..3
  PrefixExpr@0..3
    Minus@0..1 "-"
    Number@1..3 "10""#]],
        );
    }
}

Now that an expression can start with something other than a number or an identifier, we need to modify that match at the start of expr_binding_power:

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
    let checkpoint = p.checkpoint();

    match p.peek() {
        Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => p.bump(),
        Some(SyntaxKind::Minus) => todo!(),
        _ => {}
    }

    // snip
}

First, we should determine the binding power of the prefix operator we’ve found. To do this, we need to define a type that represents all the possible prefix operators we might encounter – PrefixOp sounds like a good name. Also, we should rename Op to InfixOp so that it’s differentiated from PrefixOp:

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
    let checkpoint = p.checkpoint();

    match p.peek() {
        Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => p.bump(),
        Some(SyntaxKind::Minus) => {
            let op = PrefixOp::Neg;
            let ((), right_binding_power) = op.binding_power();
        }
        _ => {}
    }

    loop {
        let op = match p.peek() {
            Some(SyntaxKind::Plus) => InfixOp::Add,
            Some(SyntaxKind::Minus) => InfixOp::Sub,
            Some(SyntaxKind::Star) => InfixOp::Mul,
            Some(SyntaxKind::Slash) => InfixOp::Div,
            _ => return, // we’ll handle errors later.
        };

        // snip
    }
}

enum InfixOp {
    Add,
    Sub,
    Mul,
    Div,
}

impl InfixOp {
    fn binding_power(&self) -> (u8, u8) {
        // snip
    }
}

enum PrefixOp {
    Neg,
}

impl PrefixOp {
    fn binding_power(&self) -> ((), u8) {
        match self {
            Self::Neg => ((), 5),
        }
    }
}

Note how the binding power of the negation operator is higher than that of any infix operator. This means that -10 + 5 is parsed as (-10) + 5 instead of -(10 + 5).

We now do the same as in the case of an infix operator:

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
    let checkpoint = p.checkpoint();

    match p.peek() {
        Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => p.bump(),
        Some(SyntaxKind::Minus) => {
            let op = PrefixOp::Neg;
            let ((), right_binding_power) = op.binding_power();

            // Eat the operator’s token.
            p.bump();

            p.start_node_at(checkpoint, SyntaxKind::PrefixExpr);
            expr_binding_power(p, right_binding_power);
            p.finish_node();
        }
        _ => {}
    }

    // snip
}
$ cargo t -q
running 22 tests
......................
test result: ok. 22 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Let’s add a test to make sure precedence is working as we would hope:

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn negation_has_higher_binding_power_than_infix_operators() {
        check(
            "-20+20",
            expect![[r#"
Root@0..6
  BinaryExpr@0..6
    PrefixExpr@0..3
      Minus@0..1 "-"
      Number@1..3 "20"
    Plus@3..4 "+"
    Number@4..6 "20""#]],
        );
    }
}
$ cargo t -q
running 23 tests
.......................
test result: ok. 23 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Parentheses

Our lexer doesn’t lex parentheses, so we should start off by adding that:

// lexer.rs

#[derive(Debug, Copy, Clone, PartialEq, Logos, FromPrimitive, ToPrimitive)]
pub(crate) enum SyntaxKind {
    // snip

    #[token("(")]
    LParen,

    #[token(")")]
    RParen,

    // snip
}

// snip

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn lex_left_parenthesis() {
        check("(", SyntaxKind::LParen);
    }

    #[test]
    fn lex_right_parenthesis() {
        check(")", SyntaxKind::RParen);
    }

    // snip
}
$ cargo t -q
running 25 tests
.........................
test result: ok. 25 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Let’s write a test for parsing a bunch of nested parentheses:

// expr.rs

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn parse_nested_parentheses() {
        check(
            "((((((10))))))",
            expect![[r#"
Root@0..14
  LParen@0..1 "("
  LParen@1..2 "("
  LParen@2..3 "("
  LParen@3..4 "("
  LParen@4..5 "("
  LParen@5..6 "("
  Number@6..8 "10"
  RParen@8..9 ")"
  RParen@9..10 ")"
  RParen@10..11 ")"
  RParen@11..12 ")"
  RParen@12..13 ")"
  RParen@13..14 ")""#]],
        );
    }
}

Let’s also add a test to see if adding parentheses allows us to change precedence as we would expect:

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn parentheses_affect_precedence() {
        check(
            "5*(2+1)",
            expect![[r#"
Root@0..7
  BinaryExpr@0..7
    Number@0..1 "5"
    Star@1..2 "*"
    LParen@2..3 "("
    BinaryExpr@3..6
      Number@3..4 "2"
      Plus@4..5 "+"
      Number@5..6 "1"
    RParen@6..7 ")""#]],
        );
    }
}

Once again we need to add another case to the match statement at the top of expr_binding_power:

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
    let checkpoint = p.checkpoint();

    match p.peek() {
        Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => p.bump(),
        Some(SyntaxKind::Minus) => {
            // snip
        }
        Some(SyntaxKind::LParen) => todo!(),
        _ => {}
    }

    // snip
}

The first thing we should do is bump the left parenthesis onto the current branch:

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
    let checkpoint = p.checkpoint();

    match p.peek() {
        Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => p.bump(),
        Some(SyntaxKind::Minus) => {
            // snip
        }
        Some(SyntaxKind::LParen) => {
            p.bump();
            todo!();
        }
        _ => {}
    }

    // snip
}

If you think about it, we can parse the contents of the parenthesised expression by calling expr_binding_power with minimum_binding_power set to 0, since parenthesising something ‘resets’ precedence inside it. All we need to do after writing that is add another p.bump() to account for the closing parenthesis:

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
    let checkpoint = p.checkpoint();

    match p.peek() {
        Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => p.bump(),
        Some(SyntaxKind::Minus) => {
            // snip
        }
        Some(SyntaxKind::LParen) => {
            p.bump();
            expr_binding_power(p, 0);

            assert_eq!(p.peek(), Some(SyntaxKind::RParen));
            p.bump();
        }
        _ => {}
    }

    // snip
}
$ cargo t -q
running 27 tests
...........................
test result: ok. 27 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Next time

In the next part we’ll restructure how the parser works internally so Eldiro can support a seemingly-basic feature: whitespace.


  1. It’s more appropriate to use PrefixExpr instead of UnaryExpr because a unary operator could also be postfix, meaning that you add it after the operand. For example, ? in Rust is a postfix operator. It makes sense to differentiate the two because we might have to treat them differently in future. ↩︎