编译器:抽象语法树和解释器

发布于 作者: Ethan

引言

本文是教授maxsnew的课堂笔记的译文。

扩展我们的语言:基本计算

Last time, we considered the following miniscule language:

上次,我们考虑了以下极小的语言:

<expr>: NUMBER

Our Rust representation of this program was simply an n: i64 and our compiler simply placed that integer in the appropriate place in the assembly:

这个程序的 Rust 表示只是一个 n: i64 ,我们的编译器只是将这个整数放在汇编的适当位置:

section .text
global start_here
start_here:
  mov rax, n
  ret

Let’s extend our language so that it does some actual computation, adding increment and decrement operations. Say, that we have access to some kind of functions add1 and sub1. Here are a few examples of programs that we might write in this language, and what answer we expect them to evaluate to:

让我们扩展我们的语言,使其能够进行实际计算,添加自增和自减操作。假设我们有某种函数 add1 和 sub1 。这里是一些我们可能用这种语言编写的程序的例子,以及我们期望它们计算的结果:

Concrete Syntax(具体语法) Answer(回答)
42 42
add1(42) 43
sub1(42) 41
sub1(add1(add1(42))) 43

Now how can we go about implementing this language? The first step is to take in our program as an input string and (1) check that it is a well-formed program and (2) rewrite it into a more convenient representation to use.

那么我们该如何实现这门语言呢?第一步是将程序作为输入字符串接收,并(1)检查它是否是一个结构良好的程序,以及(2)将其重写为一个更方便使用的表示形式。

Concrete Syntax 具体语法

How programs are written as strings is called the concrete syntax of the programming language. We can describe the syntax for our Snake language using a grammar like the following:

程序如何以字符串形式编写被称为编程语言的具体语法。我们可以使用如下语法来描述我们的 Snake 语言的语法:

<expr>: 
           | NUMBER
           | add1 ( <expr> )
           | sub1 ( <expr> )

A grammar like this describes the concrete syntax of a language by specifying how you could generate the terms in the language. In this case we have one form called expr and to create an expr we either pick a number, or we call an add1 or sub1 function on another expr, with the argument wrapped in parentheses. By convention, we ignore whitespace in this description. We’ll keep this all a bit informal for now and make it completely rigorous in the last month of the course.

这种语法通过指定如何生成语言中的术语来描述语言的具象语法。在这种情况下,我们有一个形式称为 expr ,要创建一个 expr ,我们可以选择一个数字,或者对另一个 expr 调用 add1 或 sub1 函数,参数用括号括起来。按照惯例,我们在这种描述中忽略空格。目前我们保持这种描述比较非正式,并在课程的最后一个月使其完全严谨。

Even for a simple language like this, writing a program that checks if a given string conforms to this grammar is fairly tedious. Thankfully there are widely available programs called parser generators that will generate code for a parser from fairly simple descriptions like this. In this course we use one called LALRPOP (pronounced kind of like "lollipop"). A LALRPOP implementation of the grammar for this language is given by:

即使是这种简单的语言,编写一个检查给定字符串是否符合这种语法的程序也相当繁琐。幸运的是,有广泛使用的称为解析器生成器的程序,可以从这种相对简单的描述生成解析器代码。在本课程中,我们使用一个称为 LALRPOP(发音有点像“lollipop”)的程序。这种语言的 LALRPOP 语法实现如下:

pub Expr = {
    <Num>,
    "add1" "(" <Expr> ")",
    "sub1" "(" <Expr> ")",
}

Num = {
  <r"[+-]?[0-9]+">
}

Which is almost exactly what we’ve written above, except that it’s more precise in defining a description of the syntax for numbers. Running the lalrpop tool on this code produces 500 lines of fairly inscrutable Rust code we are happy not to have to write by hand.

这几乎与我们上面所写的内容完全一致,只是它在定义数字的语法描述方面更加精确。运行 lalrpop 工具在这段代码上会产生 500 行相当难理解的 Rust 代码,我们很高兴不需要手动编写这些代码。

Abstract Syntax and Parsing 抽象语法和解析

The lalrpop program above doesn’t really define a parser, but instead a recognizer, i.e., a function that takes in an input string and tell us whether or not the input string matches our language’s grammar. A recognizer alone is not very useful for a programming language implementation. After we have identified the input string is in the language, we are still left with just a string, which is not a very convenient input type to writing an interpreter or a compiler. When we write the remainder of the implementation, we would prefer to use a representation where we can very easily answer some basic questions:

上面提到的 lalrpop 程序并没有真正定义一个解析器,而是一个识别器,即一个函数,它接收一个输入字符串并告诉我们这个输入字符串是否匹配我们的语言的语法。一个识别器本身对于编程语言实现来说并不是很有用。在我们确定输入字符串属于该语言之后,我们仍然只得到一个字符串,这并不是一个很方便的输入类型,用于编写解释器或编译器。在编写剩余的实现部分时,我们更倾向于使用一种能够非常容易地回答一些基本问题的表示方式:

  • Is the expression a number or an operation? 这个表达式是一个数字还是一个操作?
  • If it’s a number, what is that number represented as an i64? 如果是数字,那么它应该以 i64 的形式表示?
  • If it’s an operation, is it an add1 or a sub1 and what is the sub-expression? 如果是运算符,它是 add1 还是 sub1 ,以及它的子表达式是什么?

We could answer these questions using just a string as our representation, but it would be tedious and involve scanning the string, which is what the recognizer just did anyway! For more complex languages, this becomes completely infeasible.

我们可以仅使用字符串作为表示方式来回答这些问题,但这会很繁琐,并且需要扫描字符串,而识别器刚刚已经做了同样的事情!对于更复杂的语言,这种方式完全不可行。

Instead what we’ll do is represent our programs as special kinds of trees called abstract syntax trees. For this language, a tree is either a leaf, which should have a number, or a node which should be labeled as add1 or sub1 and have one child. One of the main reasons we use Rust in this course is that Rust has very good support for defining such tree types:

相反,我们将以抽象语法树这种特殊类型的树来表示我们的程序。对于这种语言,树要么是叶节点,应该包含一个数字,要么是节点,应该标记为 add1 或 sub1 ,并且只有一个子节点。我们在这门课程中使用 Rust 的主要原因之一是 Rust 对定义这种树类型有非常好的支持:

enum Expression {
    Number(i64),
    Add1(Box<Expression>),
    Sub1(Box<Expression>),
}

This is a Rust enum type, so-called because it is a type defined by enumerating the possible values. You can read more about enum types in Rust book chapter 6. This definitions matches our abstract description of the grammar: either the input is a number, or an add1 operation performed on another expression, or a sub1 operation performed on another.

这是一个 Rust 的 enum 类型,之所以称为 enum 类型,是因为它是一个通过列举可能值来定义的类型。你可以在 Rust 书籍的第 6 章中了解更多关于 enum 类型的信息。这个定义与我们对语法的抽象描述相匹配:输入要么是一个数字,要么是在另一个表达式上执行的 add1 操作,要么是在另一个表达式上执行的 sub1 操作。

Here are a couple of examples of programs and how to construct their corresponding abstract syntax trees:

这里有一些程序示例以及如何构建它们对应的抽象语法树:

Concrete Syntax(具体语法) Abstract Syntax(抽象语法)
5 Number(5)
sub1(add1(sub1(17))) Sub1(Box::new(Add1(Box::new(Sub1(Box::new(Number(5)))))))

But notice how we’ve abstracted away the details from the concrete syntax. This same abstract syntax could apply to many different concrete syntaxes. For example, we could use an imperative syntax

但我们注意到,我们已经从具体语法中抽象出了细节。同样的抽象语法可以适用于许多不同的具体语法。例如,我们可以使用命令式语法

17;
sub1;
add1;
sub1

or a lisp-style syntax 或是一种 Lisp 风格的语法

(sub1 (add1 (sub1 17)))

but almost none of those details are relevant to the rest of compilation.

但几乎这些细节与编译的其余部分无关。

To extract an abstract syntax tree from our input string, we update our lalrpop file to generate an output expression in addition to specifying the grammar:

要从我们的输入字符串中提取抽象语法树,我们更新我们的 lalrpop 文件,除了指定语法外,还生成一个输出表达式:

pub Expr: Expression = {
    <n: Num> => Expression::Number(n),
    "add1" "(" <e: Expr> ")" => Expression::Add1(Box::new(e)),
    "sub1" "(" <e: Expr> ")" => Expression::Sub1(Box::new(e)),
}

Num: i64 = {
  <s:r"[+-]?[0-9]+"> => s.parse().unwrap()
}

For this language, our type of abstract syntax trees is very simple, but this approach scales up to very complicated languages. The analogous enum in the Rust compiler (which is written in Rust) contains about 40 different cases!

对于这种语言,我们的抽象语法树类型非常简单,但这种方法可以扩展到非常复杂的语言。Rust 编译器(用 Rust 编写)中的类似 enum 结构包含大约 40 种不同的情况!

Semantics 语义

Before we define a compiler for our new language, we should first specify its semantics, i.e., how programs should be evaluated. A simple method for doing this is to define a corresponding interpreter. Like our previous language, the programs here should output a single integer number, but now add1 and sub1 should perform those mathematical operations.

在我们为新的语言定义编译器之前,我们应该首先明确其语义,即程序应该如何被评估。一种简单的方法是定义一个相应的解释器。与我们的前一个语言类似,这里的程序应该输出一个整数,但现在 add1 和 sub1 应该执行那些数学运算。

We can implement this functionality by using Rust’s pattern-matching feature, which allows us to perform a case-analysis of which constructor was used, and get access to the data contained within.

我们可以通过使用 Rust 的模式匹配功能来实现这一功能,该功能允许我们对所使用的构造函数进行情况分析,并获取其中包含的数据。

fn interpret(e: &Expression) -> i64 {
    match e {
        Expression::Number(n) => *n,
        Expression::Add1(arg) => interpret(arg) + 1,
        Expression::Sub1(arg) => interpret(arg) - 1,
    }
}

Exercise 练习

Why is the input to this interpreter a &Expression rather than a Expression? What difference does this decision make?

为什么这个解释器的输入是一个 &Expression 而不是 Expression ?这个决定有什么不同?

This pattern of programming is very common when we work with abstract syntax trees: we define a function by matching on the tree, and then recursively call the function on the sub-tree(s). 当我们处理抽象语法树时,这种编程模式非常常见:我们通过匹配树来定义一个函数,然后递归地调用该函数在子树(们)上。

Compilation 编译

Next, let’s instead write a compiler, where the additions and subtractions actually happen in assembly code rather than in our Rust implementation.

接下来,让我们编写一个编译器,其中加法和减法实际上在汇编代码中而不是在我们的 Rust 实现中发生。

x86 Basics x86 基础

Recall that we want to compile our program to a function in assembly code that takes no inputs and returns a 64-bit integer. The calling convention we use (System V AMD64) dictates that the result of such a function is stored in a register called rax. Registers like rax are a part of the abstract machine we program against in x86. Each of the general purpose registers rax, rcx, rdx, rbx, rdi, rsi, rsp, rbp, r8-r15 stores a 64-bit value and can be manipulated using a large variety of assembly code instructions. These registers are mostly indistinguishable except by conventions.

回想一下,我们希望将程序编译成一段汇编代码中的函数,该函数不接受输入并返回一个 64 位整数。我们使用的调用约定(System V AMD64)规定,此类函数的结果存储在一个名为 rax 的寄存器中。像 rax 这样的寄存器是我们编程时针对 x86 抽象机的一部分。每个通用寄存器 rax 、 rcx 、 rdx 、 rbx 、 rdi 、 rsi 、 rsp 、 rbp 、 r8 - r15 都存储一个 64 位值,并可通过多种汇编代码指令进行操作。这些寄存器除了遵循约定之外,基本上无法区分。

The only instructions we’ll need today are mov, add/sub and ret. First, the mov instruction

今天我们需要的指令只有 mov 、 add / sub 和 ret 。首先,是 mov 指令

mov dest, src

copies the value from src to a dest. mov is a surprisingly complex operation in x86, encompassing loads, stores, register-to-register copies, constants, and some memory offset calculations. In its full generality it is even Turing complete. For today let’s use it in a very simple form: dest can be a register and src can be a constant or another register, in which case it stores the value of the constant or the current value of src into dest.

将值从 src 复制到 dest 。 mov 在 x86 中是一个令人惊讶的复杂操作,包括加载、存储、寄存器到寄存器的复制、常量以及一些内存偏移计算。在其完全普遍性上,它甚至具有图灵完备性。今天我们使用一个非常简单的形式: dest 可以是一个寄存器, src 可以是一个常量或另一个寄存器,在这种情况下,它将常量的值或 src 的当前值存储到 dest 中。

Next, the two arithmetic instructions

接下来,两条算术指令

add dest, arg
sub dest, arg

continue the pattern with mov in that the first argument is updated by the instruction. The semantics of these instructions is to add (or sub) the value of arg from the current value of dest, and update dest with the result. I.e., you can think of add like a += operation and sub as a -= operation:

继续使用 mov 的模式,其中第一个参数由指令更新。这些指令的语义是将 arg 的值从 dest 的当前值中 add (或 sub ),并用结果更新 dest 。也就是说,你可以将 add 视为一个 += 操作,将 sub 视为一个 -= 操作:

add dest, arg ;;; dest += arg
sub dest, arg ;;; dest -= arg

Finally, we have

最后,我们有了

ret

Which, if the stack is set up accordingly, will return to the caller. We can be sure that ret will work properly to implement a function return as long as we make sure when it is executed that we have not updated rsp, and the value we want to return is in rax.

如果栈设置得当,这将返回到调用者。只要我们确保在执行时没有更新 rsp ,并且我们想要返回的值在 rax 中,我们就可以确信 ret 将正确地实现函数返回。

Compiling to x86 编译到 x86

We already know how to compile a number to a function that returns it. We simply mov the number in rax and then execute the ret. We can generalize this to an Adder expression like add1(sub1(add1(7))) by using the rax register as a place to store our intermediate results:

我们已经知道如何将一个数字编译成一个返回该数字的函数。我们只需将数字放入 rax 中,然后执行 ret 即可。我们可以通过使用 rax 寄存器作为存储中间结果的地方来将这种方法推广到像 add1(sub1(add1(7))) 这样的 Adder 表达式:

mov rax, 7 ;; rax holds 7
add rax, 1 ;; rax holds add1(7)
sub rax, 1 ;; rax holds sub1(add1(7))
add rax, 1 ;; rax holds add1(sub1(add1(7)))
ret        ;; return to the caller with rax holding the correct output

We have commented the above assembly code with the correspondence with the source program, and we see that in a way the assembly code is "reversed" from the original concrete syntax. Now how do we turn this implementation strategy into a compiler for abitrary expressions? Just like the interpreter, we can implement the compiler by recursive descent on the input expression. If we look at the comments above, we see a recursive pattern: after each line is executed, rax holds the result of a larger and larger sub-expression. Then we can develop a recursive strategy for implementation: compile the source program to a sequence of x86 instructions that places the result into rax. For constants this is simply a mov, and for the recursive cases of add1 and sub1, we can append a corresponding assembly operation. Finally, we append a ret instruction at the end.

我们已经用与源程序的对应关系对上述汇编代码进行了注释,我们看到汇编代码在某种程度上是“反转”了原始的具体语法。那么我们如何将这种实现策略变成一个任意表达式的编译器呢?就像解释器一样,我们可以通过递归下降输入表达式来实现编译器。如果我们看上面的注释,我们会看到一个递归模式:每执行一行, rax 中就保存了一个越来越大的子表达式的结果。然后我们可以开发一个递归的实现策略:将源程序编译成一系列将结果放入 rax 的 x86 指令。对于常量,这只是一个 mov ,而对于 add1 和 sub1 的递归情况,我们可以追加相应的汇编操作。最后,我们在最后追加一个 ret 指令。

Here’s an implementation that emits the assembly instructions directly to stdout:

这里是一个直接向 stdout 输出汇编指令的实现:

fn compile(e: &Expression) {
    fn compile_rec(e: &Expression) {
        match e {
            Expression::Number(n) => println!("mov rax, {}", n),
            Expression::Add1(arg) => {
                compile_rec(arg);
                println!("add rax, 1");
            }
            Expression::Sub1(arg) => {
                compile_rec(arg);
                println!("sub rax, 1");
            }
        }
    }
    println!(
        "        section .text
        global start_here
start_here:"
    );
    compile_rec(e);
    println!("ret");
}

Notice here that it would be counter-productive to directly use compile itself on the recursive sub-trees. Instead we define a helper function compile_rec that has a more useful behavior to just produce the instructions that move the value into rax.

请注意,直接在递归子树中使用 compile 本身是徒劳的。相反,我们定义了一个辅助函数 compile_rec ,它具有更实用的行为,只需生成将值移入 rax 的指令。

Having a compiler and an interpreter is also helpful for testing. We can write automated tests that check that our compiler and interpreter have the same behavior on all of our examples..

拥有编译器和解释器也有助于测试。我们可以编写自动化测试,检查我们的编译器和解释器在所有示例上的行为是否一致。

Optimization 优化

For this simple language there is an obvious way to produce an optimized assembly code program, one that uses as few instructions as possible. We can simply run the interpreter and compile to a program that returns the number that is output by the interpreter.

对于这种简单的语言,有一种明显的方法可以生成优化的汇编代码程序,即使用尽可能少的指令。我们可以直接运行解释器,并将其编译成一个返回解释器输出数字的程序。

fn optimized_compile(e: &Expression) {
    println!(
        "        section .text
        global start_here
start_here:
mov rax, {}
ret",
        interpret(&e)
    );
}

Here we have improved the efficiency of the compiled program by doing more work at compile-time. Usually this is a good tradeoff, as programs are run many more times than they are compiled.

在这里,我们通过在编译时做更多的工作来提高了编译程序的效率。通常这是一个很好的权衡,因为程序运行的次数远多于编译的次数。

扩展语言:动态确定的输入

Our programs were very easy to optimize because all the information we needed to determine the result was available at compile-time. Obviously this isn’t typical: useful programs interact with the external world, e.g., by making network requests, inspecting files or reading command-line arguments. Let’s extend our language to take an input from the command line.

我们的程序非常容易优化,因为我们确定结果所需的所有信息在编译时都是可用的。显然这不是典型情况:有用的程序会与外部世界交互,例如通过发起网络请求、检查文件或读取命令行参数。让我们扩展我们的语言以从命令行获取输入。

<prog>: def main ( x ) : <expr>
<expr>: 
           | NUMBER
           | x
           | add1 ( <expr> )
           | sub1 ( <expr> )
<prog>: def main ( x ) : <expr>
<expr>: 
           | NUMBER
| x
           | add1 ( <expr> )
           | sub1 ( <expr> )

Now our program consists not of a single expression, but a main "function" that takes in an argument named x. For today, let’s say that the parameter is always called x, we’ll talk about a more robust treatment of variables next time.

现在我们的程序不再是一个单独的表达式,而是一个名为 main "函数",它接受一个名为 x 的参数。今天,我们假设参数始终称为 x ,下次我们将讨论更健壮的变量处理方法。

What is the impact on our abstract syntax? We just need to add a new kind of leaf node to our abstract syntax trees that for when the program uses the input variable.

对我们的抽象语法有什么影响?我们只需要在我们的抽象语法树中添加一种新的叶节点,用于当程序使用输入变量时。

pub enum Expression {
    Variable(),
    Number(i64),
    Add1(Box<Expression>),
    Sub1(Box<Expression>),
}

And now our interpreter takes an additional argument, which corresponds to the input:

现在我们的解释器增加了一个额外的参数,它对应于输入:

fn interpret(e: &Expression, x: i64) -> i64 {
    match e {
        Expression::Variable() => x,
        Expression::Number(n) => *n,
        Expression::Add1(arg) => interpret(arg, x) + 1,
        Expression::Sub1(arg) => interpret(arg, x) - 1,
    }
}

We correspondingly need to change our Rust stub.rs wrapper to provide an input argument:

我们需要相应地更改我们的 Rust stub.rs 包装器以提供输入参数:

#[link(name = "compiled_code", kind = "static")]
extern "sysv64" {
    #[link_name = "\x01entry"]
    fn entry(param: i64) -> i64;
}

fn main() {
    let args = std::env::args();
    if args.len() != 1 {
        eprintln!("usage: {} number");
    }
    let arg = args[1]
        .parse::<i64>()
        .expect("input must be a 64-bit integer");
    let output = unsafe { entry(arg) };
    println!("{}", output);
}

Now the external function entry is defined to take an integer in as an argument. If we consult the System V AMD64 calling convention, we find that the first input argument is placed in the register rdi. Then we can compile the Variable case quite similarly to Number, but moving from rdi rather than a constant:

现在外部函数 entry 被定义为接收一个整数作为参数。如果我们查阅 System V AMD64 调用约定,会发现第一个输入参数被放置在寄存器 rdi 中。然后我们可以非常相似地编译 Variable 情况,但移动的是 rdi 而不是一个常数:

fn compile(e: &Expression) {
    fn compile_rec(e: &Expression) {
        match e {
            Expression::Variable() => println!("mov rax, rdi"),
            Expression::Number(n) => println!("mov rax, {}", n),
            Expression::Add1(arg) => {
                compile_rec(arg);
                println!("add rax, 1");
            }
            Expression::Sub1(arg) => {
                compile_rec(arg);
                println!("sub rax, 1");
            }
        }
    }
    println!(
        "        section .text
            global start_here
    start_here:"
    );
    compile_rec(e);
    println!("ret");
}

Exercise 练习

How would you write an optimized version of this compiler, such that the output program always uses of at most 3 instructions? 如何编写这个编译器的优化版本,使得输出程序始终使用不超过 3 条指令?