Classroom Glossary Public page

Week 6: SSA-IR and Compiler Explorer

1,517 words

Three modules of optimization work, and now a question: how close are you to a production compiler? This week you find out.


Reading

Required. Petzold, CODE, Ch 24 ("Languages High and Low"). Read the full chapter now if you skimmed it in Module 3. Petzold traces the arc from machine code to assemblers to compilers to the optimizing compilers of the 1980s. The chapter ends with the observation that a compiler can now produce better machine code than most humans. This week you test that claim against your own compiler.

Required. The LLVM Language Reference Manual, sections 1-2 (overview and the SSA form) and the first example in section 3 (the canonical phi-node example). Available at llvm.org/docs/LangRef.html. Skim (not deep read); the goal is to understand what phi-nodes are and why SSA makes data-flow analysis tractable.

Recommended. Cytron et al., "Efficiently Computing Static Single Assignment Form and the Control Dependence Graph" (1991; freely available). The original SSA paper. The algorithm in section 3 (inserting phi-functions at dominance frontiers) is the standard construction. Read the abstract and section 1 only; the full algorithm is a later-course topic.


Lecture: SSA-IR and Production Compiler Comparison

What SSA is

Static Single Assignment (SSA) form is an IR (intermediate representation) property: every variable is assigned exactly once. Where multiple definitions of a variable converge (at a join point in the control flow graph), SSA inserts a phi-function that selects among the incoming values.

Before SSA:

x = 1
if cond:
    x = 2
y = x + 1   # which x? depends on the branch

After SSA:

x1 = 1
if cond:
    x2 = 2
x3 = phi(x1, x2)   # phi selects x1 or x2 based on which edge was taken
y1 = x3 + 1

The phi-function makes the data flow explicit. Every use of a variable can be traced to exactly one definition. This property makes the following analyses trivial that were hard on non-SSA IR:

  • Constant propagation. If x3's phi-node shows both incoming values are the same constant, x3 is that constant.
  • Dead code elimination. If x1 is never used after its definition, it is dead.
  • Register allocation. Because each SSA variable has a single definition point and a contiguous live range, liveness analysis is precise; the interference graph for register allocation is chordal (a property that makes coloring polynomial).

Why your compiler does not need SSA yet

Your CSA-201 compiler targets Virtus OS programs, which are small enough that the simpler analyses from Modules 3-5 (linear-scan allocator, peephole, inliner, constant folding) cover almost all the wins. SSA is the right foundation for a production compiler that needs to optimize loops, hoist invariants, and reason about aliasing across function boundaries. You are building an educational compiler; SSA is the architecture you would build if you wanted to compete with gcc.

This module is about understanding SSA rather than implementing it. The Compiler Explorer lab shows you what SSA-backed optimization (gcc, Clang/LLVM) does to code that you also compile, and lets you reason about the gap.

The Compiler Explorer (godbolt.org)

godbolt.org is a browser-based tool that compiles C/C++/Rust/many other languages to assembly and shows the output. It uses real compiler binaries (gcc 14.1, clang 18.x, MSVC, and dozens of others) with full optimization flag support. It color-codes source lines to corresponding assembly instructions.

For CSA-201:

  • Target: RISC-V rv32gc gcc 14.1 (matches your architecture; the gc profile includes G = IMAFDZicsr_Zifencei + C = Compressed)
  • Flags: -march=rv32im -O0 (no optimization), -march=rv32im -O2 (standard optimization), -march=rv32im -O3 (aggressive)

The comparison is meaningful because gcc 14.1 compiling for RV32IM produces assembly on the same instruction set your compiler targets. You can compare instruction sequences line-for-line.

What to expect

For simple functions (3-10 lines of C with no loops), your compiler after Modules 3-5 should produce output close to gcc -O0, and noticeably better than gcc -O0 on functions with constant arguments that your inliner folds.

For functions with loops, your compiler will produce significantly more instructions than gcc -O1/-O2 because you have not implemented:

  • Loop-invariant code motion (LICM): moving computations that produce the same result on every iteration out of the loop
  • Strength reduction for loop induction variables: replacing i * 4 with an incrementing pointer
  • Vectorization: the auto-vectorizer in gcc -O3 emits multiple-value operations for loop bodies that are obviously independent

The goal of Lab 6.1 is not to match gcc -O2; it is to understand the gap and to identify which missing analyses explain it.

Architecture Comparison Sidebar: SSA in production compilers

LLVM IR. LLVM uses SSA-form IR throughout its middle-end. Every optimization pass (constant propagation, dead code elimination, global value numbering, loop optimizations) operates on SSA IR with phi-nodes. LLVM's register allocator operates on a virtual-register SSA IR and then destroys SSA form during instruction selection to produce the physical-register machine IR.

GCC GIMPLE. GCC's first SSA-form IR is called GIMPLE. GCC converts C source to GIMPLE, then to SSA-GIMPLE (with phi-nodes inserted), then optimizes, then lowers to RTL (Register Transfer Language) for the target backend. GCC also has a VRP (value range propagation) pass that proves ranges for SSA variables and eliminates unreachable branches.

WebAssembly. WebAssembly's structured-control-flow model (blocks, loops, ifs with explicit labels) is designed to be easily convertible to SSA IR by a JIT. V8's Turbofan JIT and SpiderMonkey's IonMonkey JIT both compile Wasm to SSA IR internally.

Your compiler's IR. Your VM bytecode (push/pop/arithmetic/call) is essentially a stack-based IR that is not SSA. Stack VMs and register VMs are duals; converting between them is straightforward. The path from your VM IR to SSA would require: (a) convert stack operations to three-address code (as your translator already does for some patterns), (b) compute dominance frontiers, (c) insert phi-functions, (d) rename variables. This is a well-defined 4-step algorithm (Cytron et al.) and is the foundation of a hypothetical CSA-301.


Lab exercises

See labs/lab-6-compiler-explorer.md for the full specification.

Lab 6.1: Compiler Explorer comparison. You will compile four benchmark functions using your CSA-201 compiler and using gcc on godbolt.org, then compare the output.

Part 1: Simple constant function. Compile int square_sum() { return 3*3 + 4*4; }. Your compiler (with inliner + folding) should produce a single li a0, 25; ret. Check whether gcc -O0, -O2, and -O3 also produce li a0, 25; ret or expand the computation.

Part 2: Short arithmetic function. Compile int clamp(int x, int lo, int hi) { if (x < lo) return lo; if (x > hi) return hi; return x; }. Compare your compiler's output against gcc -O0 and -O2. Count instructions. Record the ratio.

Part 3: Loop function. Compile int sum_array(int* a, int n) { int s = 0; for (int i = 0; i < n; i++) s += a[i]; return s; }. Your compiler will not handle raw C pointer syntax; translate to equivalent Virtus OS style and compile. Then compile the C version on godbolt.org at -O0 and -O2. How many more instructions does your output have? Identify which missing optimization explains the largest share of the difference.

Part 4: Identify the gap. Write a 1-page analysis. For each benchmark, identify the specific optimization that explains the largest remaining gap between your compiler and gcc -O2. Name the optimization technique (not just "gcc is better"). Record in your Toolchain Diary.


Independent practice

  1. Open godbolt.org and compile a 5-line function in C targeting RISC-V rv32im gcc 14.1 at -O0 and -O2. Turn on the "diff" view between -O0 and -O2. Identify three transformations that -O2 applies that -O0 does not. Look up the name of each transformation in a compilers reference.

  2. Write pseudocode for the SSA construction algorithm (Cytron et al.): compute dominance frontiers, insert phi-functions at frontiers of variable definitions, rename variables. Do not implement it; the goal is to understand the data structures it requires. What graph representation of the control flow does it need? What is a dominance frontier?

  3. Toolchain Diary entry: godbolt.org. Record the URL format for a permalink to a godbolt session. Record which compiler versions are available for RISC-V 32-bit targets. Note whether RISC-V 64-bit (rv64gc) is also available; what changes in the output when you switch from rv32 to rv64?

  4. SSA phi-functions do not correspond to any hardware instruction. How does a compiler eliminate phi-functions before emitting machine code? (This is called "SSA destruction" or "out-of-SSA translation.") What is the classic algorithm, and what correctness issues does the naive version have?


Reflection prompts

  1. Your constant-folding pass from Module 5 is an instance of "sparse conditional constant propagation" (SCCP) operating on a linear instruction stream rather than an SSA control-flow graph. How does the SSA-based SCCP (Wegman and Zadeck, 1991) handle branches that a linear-scan constant folder cannot? Construct an example where the SSA-based version folds a value that your linear pass misses.

  2. The Compiler Explorer shows you the compiled output of your exact source. What can you not learn from the Compiler Explorer that you could learn from a profiler? Which tool would you reach for first when optimizing a real program?

  3. After three modules of optimization work (register allocator, peephole, inlining + folding), describe the single biggest architectural gap between your compiler and gcc -O2. Is the gap mostly in IR quality (SSA vs stack), analysis quality (global vs local), or something else?


What's next

The compiler arc of CSA-201 is complete. Modules 7 and 8 return to hardware: you will add the Sv32 virtual memory system (two-level page tables, a TLB, and a page-fault handler) and then PMP (physical memory protection with W^X enforcement). These are the hardware features that make the security modules (9-10) and the OS features (11-13) possible. The improved compiler you ship at the end of this module is the compiler your OS kernel will be built with.