Structure of the Compiler

SICP > Computing with Register Machines > Compilation > Structure of the Compiler

In section we modified our original metacircular interpreter to separate analysis from execution. We analyzed each expression to produce an execution procedure that took an environment as argument and performed the required operations. In our compiler, we will do essentially the same analysis. Instead of producing execution procedures, however, we will generate sequences of instructions to be run by our register machine.
The procedure compile is the top-level dispatch in the compiler. It corresponds to the eval procedure of section , the analyze procedure of section , and the eval-dispatch entry point of the explicit-control-evaluator in section . The compiler, like the interpreters, uses the expression-syntax procedures defined in section . Compile performs a case analysis on the syntactic type of the expression to be compiled. For each type of expression, it dispatches to a specialized code generator:
(define (compile exp target linkage)
  (cond ((self-evaluating? exp)
         (compile-self-evaluating exp target linkage))
        ((quoted? exp) (compile-quoted exp target linkage))
        ((variable? exp)
         (compile-variable exp target linkage))
        ((assignment? exp)
         (compile-assignment exp target linkage))
        ((definition? exp)
         (compile-definition exp target linkage))
        ((if? exp) (compile-if exp target linkage))
        ((lambda? exp) (compile-lambda exp target linkage))
        ((begin? exp)
         (compile-sequence (begin-actions exp)
                           target
                           linkage))
        ((cond? exp) (compile (cond->if exp) target linkage))
        ((application? exp)
         (compile-application exp target linkage))
        (else
         (error "Unknown expression type - COMPILE" exp))))
Targets and linkages
Compile and the code generators that it calls take two arguments in addition to the expression to compile. There is a target, which specifies the register in which the compiled code is to return the value of the expression. There is also a linkage descriptor, which describes how the code resulting from the compilation of the expression should proceed when it has finished its execution. The linkage descriptor can require that the code do one of the following three things:

continue at the next instruction in sequence (this is specified by the linkage descriptor next),

return from the procedure being compiled (this is specified by the linkage descriptor return), or

jump to a named entry point (this is specified by using the designated label as the linkage descriptor).

For example, compiling the expression 5 (which is self-evaluating) with a target of the val register and a linkage of next should produce the instruction
(assign val (const 5))
Compiling the same expression with a linkage of return should produce the instructions
(assign val (const 5))
(goto (reg continue))
In the first case, execution will continue with the next instruction in the sequence. In the second case, we will return from a procedure call. In both cases, the value of the expression will be placed into the target val register.
Instruction sequences and stack usage
Each code generator returns an instruction sequence containing the object code it has generated for the expression. Code generation for a compound expression is accomplished by combining the output from simpler code generators for component expressions, just as evaluation of a compound expression is accomplished by evaluating the component expressions.
The simplest method for combining instruction sequences is a procedure called append-instruction-sequences. It takes as arguments any number of instruction sequences that are to be executed sequentially; it appends them and returns the combined sequence. That is, if seq₁ and seq₂ are sequences of instructions, then evaluating
(append-instruction-sequences seq₁ seq₂)
produces the sequence
seq₁
seq₂
Whenever registers might need to be saved, the compiler's code generators use preserving, which is a more subtle method for combining instruction sequences. Preserving takes three arguments: a set of registers and two instruction sequences that are to be executed sequentially. It appends the sequences in such a way that the contents of each register in the set is preserved over the execution of the first sequence, if this is needed for the execution of the second sequence. That is, if the first sequence modifies the register and the second sequence actually needs the register's original contents, then preserving wraps a save and a restore of the register around the first sequence before appending the sequences. Otherwise, preserving simply returns the appended instruction sequences. Thus, for example,
(preserving (list reg₁ reg₂) seq₁ seq₂)
produces one of the following four sequences of instructions, depending on how seq₁ and seq₂ use reg₁ and reg₂:

$\begin{Stabular}{c\vert c\vert c\vert c} \parbox{.4in} {\an{$seq_1$ }\hfill \\ ... ...\ {\tt (restore\ \an{$reg_2$ })}\hfill \\ \an{$seq_2$ } \hfill} \end{Stabular}$

By using preserving to combine instruction sequences the compiler avoids unnecessary stack operations. This also isolates the details of whether or not to generate save and restore instructions within the preserving procedure, separating them from the concerns that arise in writing each of the individual code generators. In fact no save or restore instructions are explicitly produced by the code generators.
In principle, we could represent an instruction sequence simply as a list of instructions. Append-instruction-sequences could then combine instruction sequences by performing an ordinary list append. However, preserving would then be a complex operation, because it would have to analyze each instruction sequence to determine how the sequence uses its registers. Preserving would be inefficient as well as complex, because it would have to analyze each of its instruction sequence arguments, even though these sequences might themselves have been constructed by calls to preserving, in which case their parts would have already been analyzed. To avoid such repetitious analysis we will associate with each instruction sequence some information about its register use. When we construct a basic instruction sequence we will provide this information explicitly, and the procedures that combine instruction sequences will derive register-use information for the combined sequence from the information associated with the component sequences.
An instruction sequence will contain three pieces of information:

the set of registers that must be initialized before the instructions in the sequence are executed (these registers are said to be needed by the sequence),

the set of registers whose values are modified by the instructions in the sequence, and

the actual instructions (also called statements) in the sequence.

We will represent an instruction sequence as a list of its three parts. The constructor for instruction sequences is thus
(define (make-instruction-sequence needs modifies statements)
  (list needs modifies statements))
For example, the two-instruction sequence that looks up the value of the variable x in the current environment, assigns the result to val, and then returns, requires registers env and continue to have been initialized, and modifies register val. This sequence would therefore be constructed as
(make-instruction-sequence '(env continue) '(val)
 '((assign val
           (op lookup-variable-value) (const x) (reg env))
   (goto (reg continue))))
We sometimes need to construct an instruction sequence with no statements:
(define (empty-instruction-sequence)
  (make-instruction-sequence '() '() '()))
The procedures for combining instruction sequences are shown in section .
Exercise. In evaluating a procedure application, the explicit-control evaluator always saves and restores the env register around the evaluation of the operator, saves and restores env around the evaluation of each operand (except the final one), saves and restores argl around the evaluation of each operand, and saves and restores proc around the evaluation of the operand sequence. For each of the following combinations, say which of these save and restore operations are superfluous and thus could be eliminated by the compiler's preserving mechanism:
(f 'x 'y)


((f) 'x 'y)


(f (g 'x) y)


(f (g 'x) 'y)
Exercise. Using the preserving mechanism, the compiler will avoid saving and restoring env around the evaluation of the operator of a combination in the case where the operator is a symbol. We could also build such optimizations into the evaluator. Indeed, the explicit-control evaluator of section already performs a similar optimization, by treating combinations with no operands as a special case.

a. Extend the explicit-control evaluator to recognize as a separate class of expressions combinations whose operator is a symbol, and to take advantage of this fact in evaluating such expressions.

b. Alyssa P. Hacker suggests that by extending the evaluator to recognize more and more special cases we could incorporate all the compiler's optimizations, and that this would eliminate the advantage of compilation altogether. What do you think of this idea?

Previous: Compilation

Next: Compiling Expressions

webmaster@arsdigita.org