Week 4: Compiler Peephole Optimisation · CSA-201 · Virtus Cyber Academy Classroom

Lab 11.4 listed five categories of compiler-output bloat and promised them to CSA-201. This week you collect on four of them.

Reading

Required. Petzold, CODE, Ch 24 ("Languages High and Low"). The section where Petzold describes the evolution from assemblers to compilers is the context for this week. Petzold notes that early compilers produced correct but inefficient code; programmers spent years writing entire programs in assembly because compilers were "too wasteful." Peephole optimization was one of the first techniques that made compiler-generated code competitive with hand-written assembly. Read Ch 24 with that history in mind.

Required. Patterson and Hennessy, Computer Organization and Design: RISC-V Edition, Appendix A (assemblers). Section A.9 covers optimizations that assemblers and linkers can apply. These include some of the same idioms the peephole pass catches.

Recommended. Allen, "A Catalogue of Optimizing Transformations" (1971, reprinted in many compilers texts). The original taxonomy of peephole patterns. Even 50 years later, the five categories the paper identifies (redundant loads/stores, algebraic identities, strength reduction, constant folding, dead-code elimination) appear in every practical peephole pass.

Lecture: Peephole Optimisation

The Lab 11.4 forward-promise

In CSA-101 Lab 11.4, you measured the code size of your compiler's output on a representative Jack program and found 30-80x bloat compared to hand-written assembly for the same computation. The lab listed five categories of waste:

Redundant push/pop pairs. The VM translator pushed a value and immediately popped it into the next instruction's source. Example: push constant 5; pop local 0 emits li t0, 5; sw t0, 0(sp); lw t1, 0(sp); sw t1, <local0>. The intermediate sw/lw pair is useless if the value went straight from register to register.
Load-after-store to same address. sw t0, N(sp); lw t1, N(sp) can be replaced with mv t1, t0 because the store and the load target the same address and no intervening instruction can modify it.
Redundant zero-initialization. Variables initialized to zero via push constant 0; pop local N emit a li t0, 0; sw t0, N(sp) pair that could be sw x0, N(sp) (the x0 hardwired zero register replaces the li).
Algebraic identities. add x, 0 generates add t0, t0, x0 which is a no-op. mul x, 1 generates mul t0, t0, <one> which could be deleted. shl x, 0 is a no-op.
Strength reduction. mul x, 2 is better as sll x, x, 1. mul x, 4 as sll x, x, 2. Division by power-of-two as arithmetic right shift.

The peephole pass works in a sliding window over the instruction stream. A window of 2-4 instructions is enough to catch all five categories above.

Window-based pattern matching

The standard implementation maintains a circular buffer of N instructions (N=4 covers the patterns above with margin). As each new instruction is added, the pass checks the buffer against a table of rewrite rules. If a rule matches, the matching instructions are replaced with the optimized form and the window is rescanned.

Rewrite rules for the five categories (RISC-V assembly notation):

# Category 1: push/pop elimination
# Pattern:  sw t0, 0(sp); addi sp, sp, -4; lw t1, 4(sp); addi sp, sp, 4
# Replace: mv t1, t0  (if t1 is only used after this sequence)

# Category 2: load-after-store
# Pattern:  sw Rx, N(Rb); lw Ry, N(Rb)
# Replace:  sw Rx, N(Rb); mv Ry, Rx
# Condition: Rb and N do not change between sw and lw

# Category 3: redundant zero-init
# Pattern:  li Rx, 0; sw Rx, N(Rb)
# Replace:  sw x0, N(Rb)

# Category 4: algebraic identities
# Pattern:  add Rd, Rs, x0     -> delete (if Rd == Rs)
# Pattern:  mul Rd, Rs, <reg holding 1> -> mv Rd, Rs  (if we can prove reg holds 1)
# Pattern:  sll Rd, Rs, x0     -> delete (if Rd == Rs)

# Category 5: strength reduction
# Pattern:  li Rt, 2; mul Rd, Rs, Rt  ->  sll Rd, Rs, 1
# Pattern:  li Rt, 4; mul Rd, Rs, Rt  ->  sll Rd, Rs, 2
# Pattern:  li Rt, 8; mul Rd, Rs, Rt  ->  sll Rd, Rs, 3

Where the peephole pass runs

The peephole pass runs as a final pass over the complete assembly output, after the register allocator. It is a pure textual transformation: it reads RISC-V assembly lines and writes an optimized version. This makes it easy to unit-test: you write input assembly, apply the pass, and compare against expected output.

Architecture Comparison Sidebar: Peephole scope in production compilers

LLVM MachineInstr window. LLVM's peephole optimizer runs on MachineInstr objects (the machine-level IR after instruction selection and register allocation). The window is typically 2-4 instructions. LLVM's pass also includes a larger "two-address instruction optimizer" that handles patterns RISC-V's three-register encoding makes rare but x86_64's two-address form creates constantly.

GCC RTL peephole. GCC's peephole optimizer runs on RTL (Register Transfer Language), the lowest IR before assembly emission. RTL peephole patterns are machine-specific and are defined in the target's .md (machine description) file. The RISC-V backend's .md file includes patterns like: replace lui + addi with a combined large-immediate load when the assembler's two-instruction expansion would produce the same result.

Hand-written RISC-V idioms. Experienced RISC-V assembly programmers use idioms that a peephole pass can generate: mv rd, rs (= addi rd, rs, 0) instead of a 3-operand add; li rd, 0 (= addi rd, x0, 0) instead of loading from memory; j label (= jal x0, label) for a non-returning jump. Your peephole pass should emit these idioms.

Lab exercises

See labs/lab-4-peephole.md for the full specification.

Lab 4.1: Peephole pass. You will implement a peephole optimizer for your compiler's assembly output. The pass reads the .asm file your translator produces and emits an optimized .asm file.

Part 1: Implement patterns for categories 2, 3, 5 (the mechanically straightforward ones). Write unit tests for each pattern: an input snippet and the expected optimized output. Run the unit tests.

Part 2: Implement categories 1 and 4. Category 1 requires tracking when a push/pop pair is a no-op; this is harder because it requires tracking stack depth across the window. Category 4 requires proving that a register holds a known constant, which requires a simple constant-propagation pass over the window.

Part 3: Measure. Compile the provided benchmark-peephole.jack source with your CSA-101 compiler (no peephole), your CSA-201 compiler with register allocator only, and your CSA-201 compiler with allocator + peephole. Count instruction lines at each stage. Record all three measurements and the two reduction ratios in your Toolchain Diary. The allocator + peephole combined should approach the Lab 11.4 target of ~1.5-2x reduction.

Independent practice

Take a function from your CSA-101 OS source (Math.multiply or Memory.alloc are good candidates) and hand-trace what the peephole pass does to its compiled output. List every pattern that fires and the instruction(s) it replaces. How many instructions does the function lose?
Write a peephole rule for a pattern not in the five categories above: sequences that load a value from memory and immediately store it to a different address without using it. Is this safe to eliminate? Under what conditions would eliminating it produce incorrect code? (Think about memory-mapped I/O.)
Toolchain Diary entry: riscv32-unknown-elf-gcc with -O1. Compile a representative C function at -O1 and examine objdump -d output. Identify three peephole-level optimizations that gcc applies. Are any of them in your pass?
The peephole pass described here is a post-allocation pass (it runs after the register allocator). Some compilers run a peephole pass before allocation instead. What patterns does a pre-allocation pass catch that a post-allocation pass misses? What patterns does a post-allocation pass catch that a pre-allocation pass misses?

Reflection prompts

Category 2 (load-after-store) requires that no intervening instruction modify the base register Rb or write to address N(Rb). In a single-threaded program on a single-core CPU, is it safe to assume no instruction modifies a stack-relative address between a sw and its matching lw? When would that assumption fail?
Strength reduction replaces mul x, 2 with sll x, 1. On a CPU with a fast multiplier (like the M extension you added in Module 1), is strength reduction still a win? Does the answer change if you consider code size vs latency separately?
Your peephole pass has a window of 4 instructions. Some optimizations require looking further back, such as constant propagation. At what window size do the diminishing returns kick in? What makes a pattern too expensive to include in a practical peephole pass?

What's next

Module 5 adds inlining and constant folding. Inlining eliminates the call overhead for small functions by substituting the function body at the call site. Constant folding evaluates constant expressions at compile time so they do not run at runtime. Together, inlining and constant folding close the five-categories baseline; Module 6 (Compiler Explorer) then compares the result against gcc at production optimization levels.