Week 3: Compiler Register Allocator · CSA-201 · Virtus Cyber Academy Classroom

Your CSA-101 compiler emitted one stack slot per live variable. Lab 7.4 documented the bloat. This week you pay it back.

Reading

Required. Petzold, CODE, Ch 17 ("Automation") and Ch 24 ("Languages High and Low"). Ch 17 establishes what a register is and why it matters: Petzold's relay computer has a handful of fixed-purpose accumulators. Every value that does not fit in a register must go somewhere else. Ch 24 traces the arc from machine code to assemblers to high-level languages, and implicitly asks: if a program can emit machine code, can it be smarter than a human about register use? The register allocator answers yes.

Required. Patterson and Hennessy, Computer Organization and Design: RISC-V Edition, Appendix A (assemblers and linkers) sections on register conventions. The RISC-V calling convention (ABI) divides the 32 registers into caller-saved temporaries (t0-t6), callee-saved saved registers (s0-s11), argument/return registers (a0-a7), and reserved special-purpose registers (ra, sp, gp, tp, fp). A register allocator that understands the ABI can use the temporary registers aggressively without generating save/restore code at function boundaries.

Recommended. Cooper and Torczon, Engineering a Compiler, 2nd ed, Chapter 13 (register allocation; Belady's algorithm, graph coloring, linear scan). This is the reference text for production register allocation; you do not need to read it cover to cover. Skim Ch 13.1 (the register allocation problem) and 13.4 (linear scan as a practical alternative to graph coloring).

Lecture: Register Allocation

The Lab 7.4 forward-promise

In CSA-101 Lab 7.4, you measured the code size of a VM-translated program. The VM translator emitted one RISC-V instruction sequence per VM opcode, and every intermediate value went through the stack. A CSA-101 add VM opcode generated: lw t0, 0(sp); lw t1, 4(sp); add t0, t0, t1; addi sp, sp, 4; sw t0, 0(sp) -- five instructions to add two values that could stay in registers. The Lab 7.4 comment said: "3-5x bloat vs what a register allocator would produce. We will pay this back in CSA-201."

This week you pay it back.

What a register allocator does

The compiler's job is to translate high-level operations (x = a + b) into machine instructions (add x0, x1, x2). The register allocator decides which machine registers hold which values at each program point. This is constrained by:

The number of available registers. RV32I has 32 registers; the ABI reserves several; the allocator has about 20 caller-saved temporaries plus 12 callee-saved saved registers to work with.
Live ranges. A variable is live from its first definition until its last use. Two variables with overlapping live ranges cannot share a register.
Spills. When more values are live simultaneously than there are registers available, some values must be spilled to the stack. A spill inserts a sw (store) before a value is evicted and an lw (load) before the next use.

Graph coloring (the classic approach)

Construct an interference graph: each variable is a node; two nodes are connected by an edge if their live ranges overlap. Assign colors (registers) to nodes such that no two adjacent nodes share the same color. This is equivalent to graph coloring, which is NP-complete in general.

For practical compilers, two simplifications make graph coloring tractable: (a) programs have far fewer simultaneously-live values than worst-case graph coloring inputs, and (b) greedy heuristics with spill-and-retry work well in practice.

Linear scan (the practical approach)

For a compiler extension that integrates with the structure you already built in CSA-101, linear scan is the right choice. Linear scan processes live intervals in order of start position and assigns registers greedily. When no register is free, it spills the interval with the furthest end point (Belady's OPT heuristic approximation).

Algorithm sketch:

sort intervals by start point
active = {}                        # intervals currently in registers
for each interval i in order:
    expire_old_intervals(active)   # free registers from expired intervals
    if |active| == num_registers:
        # spill the interval with the furthest end point
        spill = argmax(active, key=end_point)
        if spill.end > i.end:
            i.register = spill.register
            spill.stack_slot = allocate_stack_slot()
            active.remove(spill)
            active.add(i)
        else:
            i.stack_slot = allocate_stack_slot()
    else:
        i.register = pick_free_register()
        active.add(i)

Where to insert the allocator pass

Your CSA-101 compiler has this pipeline: Jack source -> tokenizer -> parser -> symbol table + codegen -> VM bytecode -> VM translator -> assembly. The register allocator operates on the VM translator's output: instead of naively spilling every intermediate to the stack, the translator's code-emitting stage calls the allocator to decide whether a value should go to a register or a stack slot.

Concretely, you will modify the VM translator (not the compiler frontend) because that is where the per-function instruction sequence is emitted. The translator already has a local temporary pool; you replace the "always spill" policy with the linear-scan assignment.

Architecture Comparison Sidebar: Register file design trade-offs

The register allocator is only useful if there are enough registers to allocate into. This is why the number of general-purpose registers matters architecturally.

6502 (3 registers). The 6502's three registers (A, X, Y) are the extreme case of register poverty. Every value that cannot fit in A, X, or Y must go in zero-page memory (addresses 0x00-0xFF, which the 6502 accesses in 2 cycles vs 4 cycles for other addresses). "Register allocation" on the 6502 mostly means deciding which variables deserve zero-page slots.

RV32I-Lite (8 registers in CSA-101). With 8 registers, the allocator had limited room. The scarcity made spilling frequent; Lab 7.4's 3-5x bloat was partly architectural.

Full RV32I (32 registers in CSA-201). 29 usable registers (x1-x31, minus sp and gp) give the allocator substantial room. A moderately complex function with 10-15 live variables fits comfortably without spilling. The standard ABI's caller-saved / callee-saved split means the allocator can use t0-t6 freely within a single function without generating save/restore code.

x86_64 (16 GPRs). The 8-register legacy constraint (eax/ebx/ecx/edx/esi/edi/ebp/esp in IA-32) created significant register pressure that AMD64 relieved by adding r8-r15. LLVM's x86_64 register allocator still spills more than its RV32I backend on the same source because 16 registers is half of 32.

Lab exercises

See labs/lab-3-register-allocator.md for the full specification.

Lab 3.1: Add the linear-scan allocator pass. You will add a linear-scan register allocator to your VM translator. The pass operates on the output of a single-function translation unit before final instruction selection.

Part 1: Compute liveness. Extend the translator to compute, for each local variable and temporary, the first instruction that defines it and the last instruction that uses it. These are the live intervals.

Part 2: Implement linear scan. Allocate registers for the function's local variables using the algorithm above. Use t0-t6 (7 caller-saved registers) as your register pool. Anything that does not fit gets a stack slot.

Part 3: Measure. Compile a benchmark program (the provided benchmark-allocator.jack source, which includes a function with 12 local variables and 3 loops) with the old always-spill translator and with your new allocator. Count the emitted instruction lines. Record the reduction ratio in your Toolchain Diary. Target: 3-5x reduction per the Lab 7.4 forward-promise.

Part 4 (optional): Extend the register pool to include s1-s11 (callee-saved). This requires generating save/restore code at function entry/exit. Measure the additional reduction vs the cost of the save/restore overhead.

Independent practice

Re-read the Lab 7.4 measurement you made in CSA-101. Find the function that had the worst bloat ratio. Trace through the new allocator's assignment for that function: which variables get registers? Which spill? Write a paragraph in your Toolchain Diary explaining why the allocation reduces code size.
The linear-scan algorithm approximates Belady's OPT by spilling the interval with the furthest end point. Construct a small example (3 registers, 6 live intervals) where this heuristic makes a suboptimal choice. What would the optimal allocation look like? How many extra spills does the heuristic produce?
Toolchain Diary entry: riscv32-unknown-elf-objdump. Use objdump -d on a compiled binary from your allocator-enabled translator. Count the ratio of lw/sw instructions to add/sub instructions. Compare with the same ratio from your CSA-101 translator output. The ratio of memory instructions to arithmetic instructions is a proxy for register pressure.
Read about the RISC-V calling convention (ABI): github.com/riscv-non-isa/riscv-elf-psabi-doc. Identify which registers are argument registers (a0-a7), which are saved registers (s0-s11), and which are temporaries (t0-t6). Why does the distinction between caller-saved and callee-saved registers matter for the register allocator?

Reflection prompts

The VM translator you built in CSA-101 could not know whether a variable would be used in the future when it emitted a push to the stack. A register allocator requires knowing all uses of a variable before allocating it. What property of the code must hold for this look-ahead to be possible? (Hint: what would a register allocator do if the program had a loop?)
Linear scan operates in a single forward pass over the live intervals. Graph coloring operates on the interference graph globally. For which kinds of programs does linear scan produce significantly worse allocations than graph coloring? For which programs does it produce near-identical results?
Your allocator uses the register pool t0-t6 by default. After you implement the callee-saved extension (s1-s11), you have up to 18 registers available. The function save/restore prologue/epilogue adds overhead. At what function size (number of local variables and temporaries) does the callee-saved extension pay off?

What's next

Module 4 adds a peephole optimization pass. Where the register allocator reduced code size by eliminating stack round-trips for values that stay in registers, the peephole pass reduces code size by finding and replacing inefficient instruction sequences in a small local window. Lab 4.1 measures the reduction against the Lab 11.4 forward-promise.