Lab 4.1: Compiler Peephole Optimisation · CSA-201 · Virtus Cyber Academy Classroom

Total points: 25
Estimated time: 3.5 hours
Prerequisites: Lab 3.1 complete; allocator-enabled compiler output available; CSA-101 Lab 11.4 baseline measurements

Overview

You will implement a peephole optimizer as a post-pass over your compiler's assembly output. The pass slides a window of 4 instructions over the instruction stream and applies rewrite rules from the five categories identified in Lab 11.4. Lab 11.4 measured 30-80x bloat from the CSA-101 compiler; the allocator from Lab 3.1 closed some of that gap; this lab closes more.

Part A: Unit tests first (5 pts)

Before implementing any patterns, write the unit test harness. The harness reads an input assembly snippet and an expected output snippet and verifies that the pass transforms the former into the latter. Write one test per pattern before implementing each pattern.

Unit tests live in tests/peephole/. Each test is a pair of files: test-N-input.asm and test-N-expected.asm. The test runner runs the peephole pass on the input and diffs against the expected output.

Write test cases for at minimum:

test-01: load-after-store to same address (category 2)
test-02: redundant zero-init via li (category 3)
test-03: mul by constant power-of-two strength reduction (category 5)
test-04: add rd, rs, x0 identity elimination (category 4)
test-05: push/pop pair elimination (category 1; may span 3-4 instructions)

All unit tests must pass before submission.

Part B: Pattern implementation (12 pts)

B1: Category 2 -- load-after-store (3 pts)

Pattern: sw Rx, N(Rb) followed immediately by lw Ry, N(Rb) where Rb and N are identical and no instruction between them modifies Rb.

Replace: sw Rx, N(Rb); mv Ry, Rx (the load becomes a register move; the store stays because the write is still needed).

Then: if Rx == Ry (the store and load use the same destination), the mv becomes a no-op and can be deleted.

Add the constraint: the pattern fires only if no intervening instruction stores to memory or modifies Rb. For a window of 2 instructions with no instruction between sw and lw, this is trivially satisfied.

B2: Category 3 -- redundant zero-init (2 pts)

Pattern: li Rx, 0; sw Rx, N(Rb).

Replace: sw x0, N(Rb).

The li Rx, 0 can be deleted if Rx is not used after the store (dead after replacement).

B3: Category 5 -- strength reduction (3 pts)

Patterns:

li Rt, 2; mul Rd, Rs, Rt → sll Rd, Rs, 1 (delete the li Rt, 2)
li Rt, 4; mul Rd, Rs, Rt → sll Rd, Rs, 2
li Rt, 8; mul Rd, Rs, Rt → sll Rd, Rs, 3
li Rt, 16; mul Rd, Rs, Rt → sll Rd, Rs, 4
li Rt, N; div Rd, Rs, Rt where N is power of 2 → srai Rd, Rs, log2(N) (signed division by positive power-of-two is arithmetic right shift)

Note: the li instruction must immediately precede the mul (or be provably the last write to Rt before the mul; simple 2-instruction window is sufficient for the basic patterns).

B4: Category 4 -- algebraic identities (2 pts)

Patterns:

add Rd, Rs, x0 where Rd == Rs → delete
add Rd, x0, Rs where Rd == Rs → delete
sll Rd, Rs, x0 (shift by zero) where Rd == Rs → delete
or Rd, Rs, x0 where Rd == Rs → delete
and Rd, Rs, Rs (AND of register with itself) → mv Rd, Rs (then if Rd == Rs, delete)

B5: Category 1 -- push/pop pair elimination (2 pts)

This is the hardest category because the push (typically addi sp, sp, -4; sw Rx, 0(sp)) and pop (typically lw Ry, 0(sp); addi sp, sp, 4) may be separated by up to 4 instructions in the window.

Simplified version: detect the pattern sw Rx, 0(sp); addi sp, sp, -4; lw Ry, 4(sp); addi sp, sp, 4 (or the equivalent with offsets matching your translator's stack layout). Replace with mv Ry, Rx.

Full credit for detecting the simplified version; bonus for detecting the full stack-round-trip across the actual translator's push/pop sequences.

Part C: Measurement and analysis (8 pts)

C1: Compile the benchmark (5 pts)

Compile benchmark-peephole.jack (provided in labs/fixtures/) with four pipeline configurations:

CSA-101 translator (no allocator, no peephole)
CSA-201 translator with allocator only
CSA-201 translator with peephole only (no allocator)
CSA-201 translator with allocator + peephole

For each configuration count: total instruction lines, lw count, sw count, mul count. Fill in this table:

Configuration	Total instructions	lw	sw	mul	vs baseline
1. CSA-101 (baseline)					1.0x
2. Allocator only
3. Peephole only
4. Allocator + peephole

C2: Analysis (3 pts)

Answer the following in 200-400 words:

Does the peephole pass improve the allocator's output more, or does the allocator improve the peephole's output more? Explain why the order matters.
Find one instruction sequence in the benchmark where the peephole pass fires but should not (false positive). What additional analysis would prevent it?
Which Lab 11.4 category accounts for the largest reduction in your measurement? Does this match what you expected?

Add a Toolchain Diary entry for the peephole pass. Record the window size you used and the number of patterns you implemented.

Grading

Part	Criteria	Points
A	Unit test harness written; at least 5 test cases present; all pass	5
B1	Load-after-store pattern fires on test-01; no false positives on 3-test corpus	3
B2	Zero-init pattern fires on test-02	2
B3	At least 3 strength-reduction variants implemented; test-03 passes	3
B4	Identity patterns implemented; test-04 passes	2
B5	Push/pop elimination implemented (simplified version is full credit)	2
C1	Complete measurement table; all four configurations compiled	5
C2	Analysis addresses all three questions with evidence	3
Total		25