Part Nineteen: Code Representations

So far we have only used two different data structures to represent code:

An abstract syntax tree is a data structure that represents the bare minimum about a chunk of code. We used ASTs in the early parts of this series prior to the Rowan rewrite. For example, here’s an AST that represents variable definitions:

struct VariableDef {
    name: String,
    value: Expr,
}

enum Expr {
    // snip
}

Note how it doesn’t include the let token or the =, and ignores trivia like whitespace and comments; it’s abstract, and only contains the information necessary to implement, say, a compiler. Although this makes them convenient to work with, it also prevents us from implementing some IDE features on top of them.

Automatic refactorings are a good example. How are meant to implement a feature that relies on knowing and being able to modify source code when we don’t know the exact text of the code being analysed?

Concrete syntax trees, on the other hand, include the text of the source code in full. We’ve used CSTs whenever we’ve used Rowan, as the trees it produces can reproduce the text they were created from. Here’s a CST (in this case from Rowan) that represents a binary expression:

Root@0..5
  InfixExpr@0..5
    Literal@0..2
      Number@0..1 "1"
      Whitespace@1..2 " "
    Plus@2..3 "+"
    Whitespace@3..4 " "
    Literal@4..5
      Number@4..5 "1"

It includes the entire input, whitespace and all. An interesting feature of Rowan’s CSTs is that they are untyped, meaning that nodes of differing kinds (e.g. VariableDef and InfixExpr) have the same type – SyntaxNode. Although this means we can implement features that don’t depend on node kinds more easily than we could otherwise,1 it also means that more complex analysis on Rowan CSTs is painful, since we need to call .kind() whenever we want to know a node’s kind. We wouldn’t have to do this if nodes of different kinds had different types.

Programming language implementations that use Rowan usually include a typed layer on top of the CST for analysis purposes. Confusingly, this layer is called an AST in spite of its knowledge of the full input text, since it uses an unmodified Rowan CST under the hood. Here’s an example of what one for binary expressions might look like:

struct BinaryExpr(SyntaxNode);

impl BinaryExpr {
    fn lhs(&self) -> Option<Expr> {
        // do stuff to extract the left-hand side
    }

    fn rhs(&self) -> Option<Expr> {
        // do stuff to extract the right-hand side
    }

    fn op(&self) -> Option<BinaryOp> {
        // do stuff to extract the operator
    }
}

Note how all these methods return Options – this is because the parser supports error recovery and allows incomplete trees; as such, these methods cannot be sure that the thing they are trying to extract is actually present.

Note also how the API provided by this is similar to the simple struct from earlier, with the distinction that components are accessed through methods that return Option, rather than non-optional fields.

Finally, programming language implementations with Rowan usually include one more, higher-level representation: the high-level intermediate representation, or HIR. These are equivalent to the classical ASTs that we used before the rewrite. For comparison, here’s what a HIR data structure that represents binary expressions could look like:

struct BinaryExpr {
    lhs: Option<Expr>,
    rhs: Option<Expr>,
    op: Option<BinaryOp>,
}

HIRs can make it easier to implement the parts of a programming language that don’t have to worry about original source code text, because this very property – the ability to differ from the original input – allows for the use of lowering.

Lowering is the process of changing syntactic sugar, or shorthand, into its lower-level form. For example, a while loop is syntactic sugar for a loop with an if statement:

while !should_exit() {
    handle_connection();
}

// is sugar for

loop {
    if should_exit() {
        break;
    }

    handle_connection();
}

Imagine you’re implementing a Rust type checker – if you don’t lower while loops, you need to worry about type-checking both loop statements and while loops. However, if you lower while loops into loop statements, you’d only have to implement type-checking for loop. The same principle can be applied to lots of other components of programming language tooling.

Lowering isn’t possible with a Rowan AST, as it has to represent the input text losslessly. It is possible with a HIR, though, since they don’t have to correspond with the input text.

Eldiro will use three different code representations:

Implementing an AST

Although we could stick this in an existing crate, it doesn’t really belong in any of them. Let’s create a new one:

$ cargo new --lib crates/ast
     Created library `crates/ast` package

This crate will need to use SyntaxNode from the syntax crate, so we’ll add that as a dependency:

# crates/ast/Cargo.toml

[dependencies]
syntax = {path = "../syntax"}

Let’s start by writing the skeleton of a variable definition’s AST node:

// lib.rs

use syntax::SyntaxNode;

#[derive(Debug)]
pub struct VariableDef(SyntaxNode);

impl VariableDef {
    pub fn name(&self) -> Option<?> {
        todo!()
    }

    pub fn value(&self) -> Option<?> {
        todo!()
    }
}

Let’s look at a typical VariableDef CST to remind us of its structure:

Root@0..13
  VariableDef@0..13
    LetKw@0..3 "let"
    Whitespace@3..4 " "
    Ident@4..7 "foo"
    Whitespace@7..8 " "
    Equals@8..9 "="
    Whitespace@9..10 " "
    VariableRef@10..13
      Ident@10..13 "bar"

We would want VariableDef::name to return that Ident, a token, not a node, and VariableDef::value should return the VariableRef node. We have a type for nodes from the Rowan syntax tree – SyntaxNode. We don’t have a type for tokens from the Rowan CST though; let’s create that now:

// crates/syntax/src/lib.rs

pub type SyntaxNode = rowan::SyntaxNode<EldiroLanguage>;
pub type SyntaxToken = rowan::SyntaxToken<EldiroLanguage>;

We can now fill in those missing types from the AST:

// crates/ast/src/lib.rs

use syntax::{SyntaxNode, SyntaxToken};

// snip

impl VariableDef {
    pub fn name(&self) -> Option<SyntaxToken> {
        todo!()
    }

    pub fn value(&self) -> Option<SyntaxNode> {
        todo!()
    }
}

To determine a VariableDef’s name, we would want to iterate over its child tokens, stoping at the first one whose kind is Ident. Rowan doesn’t give us a way to iterate over only the child tokens of a node, so we’ll iterate over the tokens and nodes (known to Rowan as SyntaxElements), filtering out the nodes:

use syntax::{SyntaxKind, SyntaxNode, SyntaxToken};

// snip

impl VariableDef {
    pub fn name(&self) -> Option<SyntaxToken> {
        self.0
            .children_with_tokens()
            .filter_map(SyntaxElement::into_token)
            .find(|token| token.kind() == SyntaxKind::Ident)
    }

    // snip
}

We need to define our own version of SyntaxElement:

// crates/syntax/src/lib.rs

pub type SyntaxNode = rowan::SyntaxNode<EldiroLanguage>;
pub type SyntaxToken = rowan::SyntaxToken<EldiroLanguage>;
pub type SyntaxElement = rowan::SyntaxElement<EldiroLanguage>;

And import it from ast:

// crates/ast/src/lib.rs

use syntax::{SyntaxElement, SyntaxKind, SyntaxNode, SyntaxToken};

Let’s define VariableDef::value by finding the first child node that is an expression. For this we can use the non-existent Expr::cast function:

impl VariableDef {
    // snip

    pub fn value(&self) -> Option<Expr> {
        self.0.children().find_map(Expr::cast)
    }
}

To continue we have to define Expr:

#[derive(Debug)]
pub enum Expr {
    BinaryExpr(BinaryExpr),
    Literal(Literal),
    ParenExpr(ParenExpr),
    UnaryExpr(UnaryExpr),
    VariableRef(VariableRef),
}

Let’s define each of its variant types:

#[derive(Debug)]
pub struct BinaryExpr(SyntaxNode);

#[derive(Debug)]
pub struct Literal(SyntaxNode);

#[derive(Debug)]
pub struct ParenExpr(SyntaxNode);

#[derive(Debug)]
pub struct UnaryExpr(SyntaxNode);

#[derive(Debug)]
pub struct VariableRef(SyntaxNode);

The only remaining compile error is Expr::cast, which we can define:

impl Expr {
    pub fn cast(node: SyntaxNode) -> Option<Self> {
        let result = match node.kind() {
            SyntaxKind::InfixExpr => Self::BinaryExpr(BinaryExpr(node)),
            SyntaxKind::Literal => Self::Literal(Literal(node)),
            SyntaxKind::ParenExpr => Self::ParenExpr(ParenExpr(node)),
            SyntaxKind::PrefixExpr => Self::UnaryExpr(UnaryExpr(node)),
            SyntaxKind::VariableRef => Self::VariableRef(VariableRef(node)),
            _ => return None,
        };

        Some(result)
    }
}

We’re still missing an AST node, namely Stmt. Its definition is similar to Expr’s:

#[derive(Debug)]
pub enum Stmt {
    VariableDef(VariableDef),
    Expr(Expr),
}

impl Stmt {
    pub fn cast(node: SyntaxNode) -> Option<Self> {
        let result = match node.kind() {
            SyntaxKind::VariableDef => Self::VariableDef(VariableDef(node)),
            _ => Self::Expr(Expr::cast(node)?),
        };

        Some(result)
    }
}

Next, let’s add methods to BinaryExpr to extract the left-hand side, right-hand side and operator:

impl BinaryExpr {
    pub fn lhs(&self) -> Option<Expr> {
        self.0.children().find_map(Expr::cast)
    }

    pub fn rhs(&self) -> Option<Expr> {
        self.0.children().filter_map(Expr::cast).nth(1)
    }

    pub fn op(&self) -> Option<SyntaxToken> {
        self.0
            .children_with_tokens()
            .filter_map(SyntaxElement::into_token)
            .find(|token| {
                matches!(
                    token.kind(),
                    SyntaxKind::Plus | SyntaxKind::Minus | SyntaxKind::Star | SyntaxKind::Slash,
                )
            })
    }
}

UnaryExpr also needs equivalents of these:

impl UnaryExpr {
    pub fn expr(&self) -> Option<Expr> {
        self.0.children().find_map(Expr::cast)
    }

    pub fn op(&self) -> Option<SyntaxToken> {
        self.0
            .children_with_tokens()
            .filter_map(SyntaxElement::into_token)
            .find(|token| token.kind() == SyntaxKind::Minus)
    }
}

And ParenExpr could use a method that extracts the contained expression:

impl ParenExpr {
    pub fn expr(&self) -> Option<Expr> {
        self.0.children().find_map(Expr::cast)
    }
}

VariableRef and Literal too:

use smol_str::SmolStr;
use syntax::{SyntaxElement, SyntaxKind, SyntaxNode, SyntaxToken};

impl Literal {
    pub fn parse(&self) -> u64 {
        self.0.first_token().unwrap().text().parse().unwrap()
    }
}

// snip

impl VariableRef {
    pub fn name(&self) -> SmolStr {
        self.0.first_token().unwrap().text().clone()
    }
}

Let’s add the missing smol_str dependency:

# Cargo.toml

[dependencies]
smol_str = "0.1.17"
syntax = {path = "../syntax"}

Finally, we need to define the root of the AST:

#[derive(Debug)]
pub struct Root(SyntaxNode);

impl Root {
    pub fn cast(node: SyntaxNode) -> Option<Self> {
        if node.kind() == SyntaxKind::Root {
            Some(Self(node))
        } else {
            None
        }
    }
}

For any other code to be able to interact with the contents of Root, it needs to have methods that make its contents accessible. Since the parser currently consumes multiple statements, so we can make those available:

impl Root {
    // snip

    pub fn stmts(&self) -> impl Iterator<Item = Stmt> {
        self.0.children().filter_map(Stmt::cast)
    }
}

Let’s try this code out by printing the value of any variable definition we might see in the syntax tree when in the REPL:

// crates/parser/src/lib.rs

impl Parse {
    pub fn debug_tree(&self) -> String {
        let mut s = String::new();

        let tree = format!("{:#?}", self.syntax());

        // snip
    }

    pub fn syntax(&self) -> SyntaxNode {
        SyntaxNode::new_root(self.green_node.clone())
    }
}
// crates/eldiro/src/main.rs

fn main() -> io::Result<()> {
    // snip

    loop {
        // snip

        let parse = parse(&input);
        println!("{}", parse.debug_tree());

        let root = ast::Root::cast(parse.syntax()).unwrap();

        dbg!(root
            .stmts()
            .filter_map(|stmt| if let ast::Stmt::VariableDef(var_def) = stmt {
                Some(var_def.value())
            } else {
                None
            })
            .collect::<Vec<_>>());

        // snip
    }
}

The REPL now has to depend on ast:

# Cargo.toml

[dependencies]
ast = {path = "../ast"}
parser = {path = "../parser"}

We can now run the REPL:

$ cargo r -q
→ let one = 1 let two = one + one
# snip
[
    Some(
        Literal(
            Literal(
                Literal@10..12
                  Number@10..11 "1"
                  Whitespace@11..12 " "
                ,
            ),
        ),
    ),
    Some(
        BinaryExpr(
            BinaryExpr(
                InfixExpr@22..32
                  VariableRef@22..26
                    Ident@22..25 "one"
                    Whitespace@25..26 " "
                  Plus@26..27 "+"
                  Whitespace@27..28 " "
                  VariableRef@28..32
                    Ident@28..31 "one"
                    Whitespace@31..32 "\n"
                ,
            ),
        ),
    ),
]

It’s extracted that 1 literal from the first variable, as well as the binary expression from the second variable.

Note how each of the two expressions our REPL is printing out is wrapped in a Some – this is because it’s possible for a variable definition to not have a value. Although this isn’t allowed by the language, it still happens because the parser recovers from the error and marches on. Let’s see if we get a None when we type in a variable definition that’s missing a value:

→ let a = let b = 1 let c =
# snip
[
    None,
    Some(
        Literal(
            Literal(
                Literal@16..18
                  Number@16..17 "1"
                  Whitespace@17..18 " "
                ,
            ),
        ),
    ),
    None,
]

As expected, since the first and last variable definitions are missing values, we get None, Some and None.

Implementing a HIR

Now that we have completed the implementation of a lossless AST, we can move on to implementing a HIR. In case you’ve forgotten, a HIR is a lowered, abstract representation of code that doesn’t include syntactic details and doesn’t have to match the input text exactly. As with the AST, we’ll start by letting Cargo generate a new crate for us:

$ cargo new --lib crates/hir
     Created library `crates/hir` package

Let’s start with statements and variable definitions:

// crates/hir/src/lib.rs

use smol_str::SmolStr;

#[derive(Debug)]
pub enum Stmt {
    VariableDef { name: SmolStr, value: Expr },
    Expr(Expr),
}
# Cargo.toml

[dependencies]
smol_str = "0.1.17"

Next we’ll define Expr:

#[derive(Debug)]
pub enum Expr {
    Binary { op: BinaryOp, lhs: Self, rhs: Self },
    Literal { n: u64 },
    Unary { op: UnaryOp, expr: Self },
    VariableRef { var: SmolStr },
}

#[derive(Debug)]
pub enum BinaryOp {
    Add,
    Sub,
    Mul,
    Div,
}

#[derive(Debug)]
pub enum UnaryOp {
    Neg,
}

We now have the same problem as we had in Part Eight; namely, we have a type that contains itself, and thus has an infinite size. We can fix this by wrapping all recursive usages of Expr in Box, which has a fixed size. This eliminates the recursion:

#[derive(Debug)]
pub enum Expr {
    Binary {
        op: BinaryOp,
        lhs: Box<Self>,
        rhs: Box<Self>,
    },
    Literal {
        n: u64,
    },
    Unary {
        op: UnaryOp,
        expr: Box<Self>,
    },
    VariableRef {
        var: SmolStr,
    },
}

This HIR is missing a critical component of our Rowan-based AST: the ability to represent incorrect parser inputs. We could represent the possibility of, say, a binary expression missing its right-hand side by wrapping rhs in an Option. We’d have to do this at every usage of Expr, though. Instead, we’ll add a Missing variant to Expr, so all usages of Expr are effectively optional:

#[derive(Debug)]
pub enum Expr {
    Missing,
    // snip
}

There’s only one other part of the HIR that can be missing from the AST: VariableDef’s name field. A VariableDef node can be missing a name when the user has only typed the let keyword, since let is enough to trigger the creation of a VariableDef. In spite of this, we shouldn’t make name optional, because the HIR is used for features like type inference and IDE indexing that rely on the presence of names. If we made names optional, we’d have to handle the case where they don’t exist again and again.

Now that the HIR’s basic data structures are in place, we can implement lowering:

# Cargo.toml

[dependencies]
ast = {path = "../ast"}
smol_str = "0.1.17"
syntax = {path = "../syntax"}
// lib.rs

use smol_str::SmolStr;
use syntax::SyntaxKind;

// snip

impl Stmt {
    fn lower(ast: ast::Stmt) -> Option<Self> {
        let result = match ast {
            ast::Stmt::VariableDef(ast) => Self::VariableDef {
                name: ast.name()?.text().clone(),
                value: Expr::lower(ast.value()),
            },
            ast::Stmt::Expr(ast) => Self::Expr(Expr::lower(Some(ast))),
        };

        Some(result)
    }
}

// snip

impl Expr {
    fn lower(ast: Option<ast::Expr>) -> Self {
        if let Some(ast) = ast {
            match ast {
                ast::Expr::BinaryExpr(ast) => Self::lower_binary(ast),
                ast::Expr::Literal(ast) => Self::Literal { n: ast.parse() },
                ast::Expr::ParenExpr(ast) => Expr::lower(ast.expr()),
                ast::Expr::UnaryExpr(ast) => Self::lower_unary(ast),
                ast::Expr::VariableRef(ast) => Self::VariableRef { var: ast.name() },
            }
        } else {
            Self::Missing
        }
    }

    fn lower_binary(ast: ast::BinaryExpr) -> Self {
        let op = match ast.op().unwrap().kind() {
            SyntaxKind::Plus => BinaryOp::Add,
            SyntaxKind::Minus => BinaryOp::Sub,
            SyntaxKind::Star => BinaryOp::Mul,
            SyntaxKind::Slash => BinaryOp::Div,
            _ => unreachable!(),
        };

        Self::Binary {
            op,
            lhs: Box::new(Expr::lower(ast.lhs())),
            rhs: Box::new(Expr::lower(ast.rhs())),
        }
    }

    fn lower_unary(ast: ast::UnaryExpr) -> Self {
        let op = match ast.op().unwrap().kind() {
            SyntaxKind::Minus => UnaryOp::Neg,
            _ => unreachable!(),
        };

        Self::Unary {
            op,
            expr: Box::new(Expr::lower(ast.expr())),
        }
    }
}

All that’s left to do for lowering is to write a lower function that takes the root of an AST, and lowers it to a Vec<Stmt>:

pub fn lower(ast: ast::Root) -> impl Iterator<Item = Stmt> {
    ast.stmts().filter_map(Stmt::lower)
}

Let’s try lowering out by making the REPL print all HIR elements it can find in the input:

# crates/eldiro/Cargo.toml

[dependencies]
ast = {path = "../ast"}
hir = {path = "../hir"}
parser = {path = "../parser"}
// crates/eldiro/src/main.rs

fn main() -> io::Result<()> {
    // snip

    loop {
        // snip

        let root = ast::Root::cast(parse.syntax()).unwrap();

        dbg!(/* snip */);

        dbg!(hir::lower(root).collect::<Vec<_>>());

        // snip
    }
}
$ cargo r -q
→ let a = 10
# snip
[
    VariableDef {
        name: "a",
        value: Literal {
            n: 10,
        },
    },
]
→ 5 + a * 100
# snip
[
    Expr(
        Binary {
            op: Add,
            lhs: Literal {
                n: 5,
            },
            rhs: Binary {
                op: Mul,
                lhs: VariableRef {
                    var: "a",
                },
                rhs: Literal {
                    n: 100,
                },
            },
        },
    ),
]

Migrating to an arena

We currently use Box to add indirection where Expr recurses. Let’s take a look at the layout of Expr in memory:

A visualisation of Expr’s layout in memory

Each of those arrows is a pointer to another location in memory, one that could be far away; there is a significant chance that the HIR elements are spread out across the heap. This is bad for performance, because it means the CPU can’t keep them all in the cache at the same time.

We can instead store each Expr in a Vec, and index into this Vec instead of using Boxes to reference the subtrees. Here’s a visualisation of the same Expr’s memory layout with this setup:

A visualisation of Expr’s memory layout when storing subtrees in a Vector

Since Vec’s storage is contiguous, we know that each Expr will be right next to all the others, encouraging caching and increasing performance. A secondary benefit of this approach is that the indices into the Vec don’t necessarily have to be usizes – we could use u32s instead to save memory, since it’s unlikely we’ll have more than 232 – 1 = four billion Expr subtrees.

A Vec gives us too much control over the subtrees; all we want is to allocate another subtree, and get its index. Luckily, there is a data structure that does this job: the arena. There are lots of arena implementations in Rust. Each one that I have seen has more complexity than the basic Vec-like structure I described earlier. In some cases this is to allow the allocation of more than usize::MAX elements; in other cases the complexity allows removing elements one by one without invalidating existing indexes; in others still, it allows customisation of the index type (e.g. usize).

Let’s write our own, as-simple-as-possible arena implementation. We’ll start with a new crate:

$ cargo new --lib crates/arena
     Created library `crates/arena` package
// crates/arena/src/lib.rs

#[derive(Debug)]
pub struct Arena<T> {
    data: Vec<T>,
}

We need methods to create new arenas and to allocate values on the arena:

impl<T> Default for Arena<T> {
    fn default() -> Self {
        Self::new()
    }
}

impl<T> Arena<T> {
    pub fn new() -> Self {
        Self { data: Vec::new() }
    }

    pub fn alloc(&mut self, t: T) -> usize {
        let idx = self.next_idx();
        self.data.push(t);

        idx
    }

    fn next_idx(&self) -> usize {
        self.data.len()
    }
}

We also need to allow indexing into the arena to retrieve values:

use std::ops::Index;

// snip

impl<T> Index<usize> for Arena<T> {
    type Output = T;

    fn index(&self, idx: usize) -> &Self::Output {
        &self.data[idx]
    }
}

Let’s employ that index size optimisation from earlier, where we use u32s for indexes instead of usizes:

impl<T> Arena<T> {
    // snip

    pub fn alloc(&mut self, t: T) -> u32 {
        let idx = self.next_idx();
        self.data.push(t);

        idx
    }

    fn next_idx(&self) -> u32 {
        self.data.len() as u32
    }
}

impl<T> Index<u32> for Arena<T> {
    type Output = T;

    fn index(&self, idx: u32) -> &Self::Output {
        &self.data[idx as usize]
    }
}

We’ve created a new problem for ourselves by doing this, though: when reading code, usizes are often used as indexes, so we can be reasonably sure what their purpose is. However, now that we’re using u32s, it can get confusing – if you see

// This is just an example; don’t write this!

struct BinaryExpr {
    lhs: u32,
    rhs: u32,
    op: BinaryOp,
}

in a codebase, it isn’t clear what the usage of that u32 looks like. We can alleviate this problem by wrapping the u32 in a newtype:

#[derive(Debug)]
pub struct Idx {
    raw: u32,
}

We can replace all usages of u32 with Idx and add a little bit of glue:

impl<T> Arena<T> {
    // snip

    pub fn alloc(&mut self, t: T) -> Idx {
        let idx = self.next_idx();
        self.data.push(t);

        idx
    }

    fn next_idx(&self) -> Idx {
        Idx {
            raw: self.data.len() as u32,
        }
    }
}

impl<T> Index<Idx> for Arena<T> {
    type Output = T;

    fn index(&self, idx: Idx) -> &Self::Output {
        &self.data[idx.raw as usize]
    }
}

Let’s take a look at that example from earlier again with this change:

// This is just an example; don’t write this!

struct BinaryExpr {
    lhs: Idx,
    rhs: Idx,
    op: BinaryOp,
}

It’s better, but it’s still unclear. What type do you get as a result of indexing? Let’s add a type parameter to Idx to improve this further:

#[derive(Debug)]
pub struct Idx<T> {
    raw: u32,
    _phantom: PhantomData<fn() -> T>,
}

PhantomData is a way of telling the Rust compiler that, although this type may not actually use the T type parameter, it still should have the parameter and shouldn’t have an ‘unused type parameter’ error. We don’t use PhantomData<T> in this case, because that communicates to the compiler that Idx owns a T, which can affect drop check. PhantomData<fn() -> T> avoids this.

Let’s update all of Idx’s usages:

impl<T> Arena<T> {
    // snip

    pub fn alloc(&mut self, t: T) -> Idx<T> {
        let idx = self.next_idx();
        self.data.push(t);

        idx
    }

    fn next_idx(&self) -> Idx<T> {
        Idx {
            raw: self.data.len() as u32,
            _phantom: PhantomData,
        }
    }
}

impl<T> Index<Idx<T>> for Arena<T> {
    type Output = T;

    fn index(&self, idx: Idx<T>) -> &Self::Output {
        &self.data[idx.raw as usize]
    }
}

Now that our arena implementation is complete, we can make use of it in the HIR:

# crates/hir/Cargo.toml

[dependencies]
arena = {path = "../arena"}
ast = {path = "../ast"}
smol_str = "0.1.17"
syntax = {path = "../syntax"}

The first thing we need is a type to store the arena:

// lib.rs

mod database;
pub use database::Database;
// crates/hir/src/database.rs

use crate::Expr;
use arena::Arena;

#[derive(Debug, Default)]
pub struct Database {
    exprs: Arena<Expr>,
}

Let’s replace Box<Expr> with Idx<Expr>:

// lib.rs

use arena::Idx;
use smol_str::SmolStr;
use syntax::SyntaxKind;

// snip

#[derive(Debug)]
pub enum Expr {
    Missing,
    Binary {
        op: BinaryOp,
        lhs: Idx<Self>,
        rhs: Idx<Self>,
    },
    Literal {
        n: u64,
    },
    Unary {
        op: UnaryOp,
        expr: Idx<Self>,
    },
    VariableRef {
        var: SmolStr,
    },
}

We can make this a little cleaner by creating a type alias for Idx<Expr>:

type ExprIdx = Idx<Expr>;

// snip

#[derive(Debug)]
pub enum Expr {
    Missing,
    Binary {
        op: BinaryOp,
        lhs: ExprIdx,
        rhs: ExprIdx,
    },
    Literal {
        n: u64,
    },
    Unary {
        op: UnaryOp,
        expr: ExprIdx,
    },
    VariableRef {
        var: SmolStr,
    },
}

All the lowering methods need to take a &mut Database as a parameter:

impl Stmt {
    fn lower(ast: ast::Stmt, db: &mut Database) -> Option<Self> {
        let result = match ast {
            ast::Stmt::VariableDef(ast) => Self::VariableDef {
                name: ast.name()?.text().clone(),
                value: Expr::lower(ast.value(), db),
            },
            ast::Stmt::Expr(ast) => Self::Expr(Expr::lower(Some(ast), db)),
        };

        Some(result)
    }
}

impl Expr {
    fn lower(ast: Option<ast::Expr>, db: &mut Database) -> Self {
        if let Some(ast) = ast {
            match ast {
                ast::Expr::BinaryExpr(ast) => Self::lower_binary(ast, db),
                // snip
                ast::Expr::ParenExpr(ast) => Expr::lower(ast.expr(), db),
                ast::Expr::UnaryExpr(ast) => Self::lower_unary(ast, db),
                // snip
            }
        } else {
            // snip
        }
    }

    fn lower_binary(ast: ast::BinaryExpr, db: &mut Database) -> Self {
        // snip

        Self::Binary {
            op,
            lhs: db.exprs.alloc(Expr::lower(ast.lhs(), db)),
            rhs: db.exprs.alloc(Expr::lower(ast.rhs(), db)),
        }
    }

    fn lower_unary(ast: ast::UnaryExpr, db: &mut Database) -> Self {
        // snip

        Self::Unary {
            op,
            expr: db.exprs.alloc(Expr::lower(ast.expr(), db)),
        }
    }
}

We’ve got a code smell here: we’re threading the same value through each and every method. This is a sign that these methods should be impled on the type of that value, which in this case is &mut Database. Let’s change the lowering code to be methods on Database:

// database.rs

use crate::{BinaryOp, Expr, Stmt, UnaryOp};
use arena::Arena;
use syntax::SyntaxKind;

// snip

impl Database {
    pub(crate) fn lower_stmt(&mut self, ast: ast::Stmt) -> Option<Stmt> {
        let result = match ast {
            ast::Stmt::VariableDef(ast) => Stmt::VariableDef {
                name: ast.name()?.text().clone(),
                value: self.lower_expr(ast.value()),
            },
            ast::Stmt::Expr(ast) => Stmt::Expr(self.lower_expr(Some(ast))),
        };

        Some(result)
    }

    pub(crate) fn lower_expr(&mut self, ast: Option<ast::Expr>) -> Expr {
        if let Some(ast) = ast {
            match ast {
                ast::Expr::BinaryExpr(ast) => self.lower_binary(ast),
                ast::Expr::Literal(ast) => Expr::Literal { n: ast.parse() },
                ast::Expr::ParenExpr(ast) => self.lower_expr(ast.expr()),
                ast::Expr::UnaryExpr(ast) => self.lower_unary(ast),
                ast::Expr::VariableRef(ast) => Expr::VariableRef { var: ast.name() },
            }
        } else {
            Expr::Missing
        }
    }

    fn lower_binary(&mut self, ast: ast::BinaryExpr) -> Expr {
        let op = match ast.op().unwrap().kind() {
            SyntaxKind::Plus => BinaryOp::Add,
            SyntaxKind::Minus => BinaryOp::Sub,
            SyntaxKind::Star => BinaryOp::Mul,
            SyntaxKind::Slash => BinaryOp::Div,
            _ => unreachable!(),
        };

        Expr::Binary {
            op,
            lhs: self.exprs.alloc(self.lower_expr(ast.lhs())),
            rhs: self.exprs.alloc(self.lower_expr(ast.rhs())),
        }
    }

    fn lower_unary(&mut self, ast: ast::UnaryExpr) -> Expr {
        let op = match ast.op().unwrap().kind() {
            SyntaxKind::Minus => UnaryOp::Neg,
            _ => unreachable!(),
        };

        Expr::Unary {
            op,
            expr: self.exprs.alloc(self.lower_expr(ast.expr())),
        }
    }
}

Finally, we need to update hir::lower to call Database::lower_stmt:

// lib.rs

pub fn lower(ast: ast::Root) -> (Database, impl Iterator<Item = Stmt>) {
    let mut db = Database::default();
    (db, ast.stmts().filter_map(|stmt| db.lower_stmt(stmt)))
}

Note how we return the database so that the user of the function can use the otherwise-opaque arena indexes. This function doesn’t compile though, because returning the iterator means that the closure passed to filter_map won’t be run until the caller decides to consume the iterator. Rust can’t verify that db will still be alive when the closure is run, so it gives us an error. We can fix this by collecting the iterator to a Vec:

pub fn lower(ast: ast::Root) -> (Database, Vec<Stmt>) {
    let mut db = Database::default();
    let stmts = ast.stmts().filter_map(|stmt| db.lower_stmt(stmt)).collect();

    (db, stmts)
}

Next, we have several errors caused by borrowing mutably more than once at a time. These can be fixed by extracting the first mutable borrows to temporary variables:

// database.rs

impl Database {
    // snip

    fn lower_binary(&mut self, ast: ast::BinaryExpr) -> Expr {
        // snip

        let lhs = self.lower_expr(ast.lhs());
        let rhs = self.lower_expr(ast.rhs());

        Expr::Binary {
            op,
            lhs: self.exprs.alloc(lhs),
            rhs: self.exprs.alloc(rhs),
        }
    }

    fn lower_unary(&mut self, ast: ast::UnaryExpr) -> Expr {
        // snip

        let expr = self.lower_expr(ast.expr());

        Expr::Unary {
            op,
            expr: self.exprs.alloc(expr),
        }
    }
}

The last remaining error stems from us trying to collect the result of hir::lower in the REPL. Let’s remove this:

// crates/eldiro/src/main.rs

fn main() -> io::Result<()> {
    // snip

    loop {
        // snip

        dbg!(hir::lower(root));

        input.clear();
    }
}

We also have an unused import that we should delete:

// crates/hir/src/lib.rs

mod database;
pub use database::Database;

use arena::Idx;
use smol_str::SmolStr;

The arena we’ve created is very similar to la-arena, the arena used in rust-analyzer, which is published on crates.io. In fact, la-arena’s implementation is essentially the same as ours, with a few more bells and whistles.2 Let’s delete our implementation, and use la-arena instead:

$ rm -r crates/arena
# Cargo.toml

[dependencies]
ast = {path = "../ast"}
la-arena = "0.2.0"
smol_str = "0.1.17"
syntax = {path = "../syntax"}
// lib.rs

mod database;
pub use database::Database;

use la_arena::Idx;
use smol_str::SmolStr;
// database.rs

use crate::{BinaryOp, Expr, Stmt, UnaryOp};
use la_arena::Arena;
use syntax::SyntaxKind;

Conclusion

I hope you’re enjoying the series; the next part will be a shorter one focusing on tests.


  1. An example here is Expand Selection. ↩︎

  2. la-arena has more methods (such as len and iter), allows the creation of arbitrary Idx values, and has a customised Debug implementation that is easier to read. ↩︎