Week 5: Compiler Inlining and Constant Folding · CSA-201

Every call to Math.abs or Math.max in CSA-101 generated a function prologue, an argument push, a JAL, a callee frame, a JALR return. This week you eliminate most of that.

Reading

Required. Petzold, CODE, Ch 22 ("The Operating System") and Ch 24 ("Languages High and Low"). This week's Petzold weave is about procedure calls and their costs. In Ch 22, Petzold describes the call stack; in Ch 24, he traces how high-level languages generate those call sequences. The function call overhead you are eliminating this week is the cost of the abstraction boundary Petzold describes. Inlining trades that boundary for code size: the function body appears at each call site rather than once in a shared location.

Required. Patterson and Hennessy, Computer Organization and Design: RISC-V Edition, Chapter 2, sections 2.8 (supporting procedures) and 2.12 (translating and starting a program). The calling convention discussion in 2.8 describes exactly the sequence your inliner eliminates: save caller-saved registers, move arguments into a0-a7, jal, move results from a0, restore.

Recommended. Appel, Modern Compiler Implementation in Java (or C or ML editions), Chapter 15 (functional languages) and Chapter 16 (garbage collection). Chapter 15 covers inlining and closure conversion; Chapter 16 previews Module 10's tracing GC. Skim 15.1-15.3 now as background.

Lecture: Inlining and Constant Folding

The call overhead problem

A function call in your CSA-201 compiler (with the allocator and peephole from Modules 3-4) still has fixed overhead: save ra, move arguments, JAL to the callee, execute the callee's prologue/epilogue, JALR back. For a function like Math.abs whose body is 3 instructions (bge a0, x0, done; neg a0, a0; done: ret), the call overhead is larger than the body.

In your CSA-101 Virtus OS, Math.abs, Math.max, Math.min, Math.signum, and several String methods are all small enough that call overhead dominates. Lab 5.1 will quantify this.

What inlining does

The inliner substitutes the function body at each call site. The caller does not save ra, push arguments, or branch; the callee body is pasted in-place. The register allocator runs again over the merged code to eliminate the redundant argument moves.

Inlining is profitable when the callee's body is small relative to the call overhead. It is unprofitable when the callee is large (code-size explosion) or called from many sites (per-call expansion).

A practical inlining policy for your compiler:

Inline unconditionally if the callee has fewer than 10 instructions (this covers all Virtus OS Math.* utility functions).
Do not inline if the callee is recursive (would produce infinite expansion).
Do not inline if the callee is called more than 20 times and has more than 5 instructions (code-size concern).

Where the inliner runs

The inliner runs before code generation (in the compiler frontend, on the parse tree or IR) or after code generation (as a post-processing pass on assembly). Both positions work; the assembly-level version is simpler to implement because you already have a flat instruction stream.

For an assembly-level inliner: for each call site call Math.abs, replace the call instruction with the body of Math.abs after alpha-renaming its registers (substituting the argument registers at the call site for the parameter registers in the callee body). Then run the peephole pass again to clean up any new redundancies.

Constant folding

Constant folding evaluates constant expressions at compile time. If your compiler emits li t0, 3; li t1, 4; add t2, t0, t1, the result 7 is known at compile time; the three instructions can be replaced with li t2, 7.

Constant folding runs as a simple abstract interpretation pass over the instruction stream: maintain a table mapping registers to their known constant values (if any). When you see li Rd, imm, add Rd -> imm to the table. When you see an arithmetic instruction where all source registers have known values, fold the result and add Rd -> result to the table. When you see a store, branch, or any instruction that writes through a pointer, conservatively clear entries that might have been invalidated.

The combination of inlining + constant folding is powerful: inlining exposes the inlined function's constants to the caller's constant-folding pass. A call like Math.abs(-3) inlines to bge a0, x0, done; neg a0, a0; done: ret with a0 = -3; constant folding then evaluates the branch (taken, since -3 < 0) and folds the negation to produce li a0, 3 with no branch at all. This is the "super-optimization" effect that makes inlining + folding together produce much better results than either alone.

The five Lab 11.4 categories revisited

After Modules 3, 4, and 5:

Category 1 (push/pop): eliminated by register allocator
Category 2 (load-after-store): eliminated by peephole
Category 3 (redundant zero-init): eliminated by peephole
Category 4 (algebraic identities): partially by peephole, fully by constant folding
Category 5 (strength reduction): eliminated by peephole

Lab 5.1 measures the combined effect on a benchmark that exercises all five categories. The target is approaching the Lab 11.4 promise of ~1.5-2x additional improvement beyond the allocator pass.

Architecture Comparison Sidebar: Inlining policy in production compilers

GCC/Clang (-O2 inlining). GCC uses a weighted instruction count heuristic called "insns" to decide whether to inline. The default threshold at -O2 is --param max-inline-insns-single=400. Functions smaller than this threshold are inlined unconditionally on the first call-site (one copy), then conditionally on later sites based on estimated growth. Clang uses a similar cost model with an additional "call site hotness" factor from profile-guided optimization.

JVM JIT inlining. The JVM's C2 JIT inlines based on bytecode count (default 35 bytecodes), call frequency (hot methods are inlined more aggressively), and a recursion guard. The JVM's ability to de-optimize and re-inline (via uncommon traps) lets it inline speculatively and fall back if the assumption proves wrong. This is not available to ahead-of-time compilers like yours.

Your compiler's naïve model. The 10-instruction threshold you implement this week is a simplified version of GCC's cost model. It is correct for Virtus OS's small utility functions. It would fail for a program with large callee functions called from many sites; Module 6 (Compiler Explorer) will show you what that failure looks like when you compare against gcc -O2 on pathological cases.

Lab exercises

See labs/lab-5-inlining-constant-folding.md for the full specification.

Lab 5.1: Inliner + constant folding pass. You will add inlining and constant folding to your compiler pipeline.

Part 1: Implement the assembly-level inliner. Handle the case where the callee body uses registers that conflict with the caller's register allocation; add register renaming. Test with Math.abs, Math.max, and Math.signum inlined at their call sites.

Part 2: Implement constant folding. Implement the abstract interpretation pass over the instruction stream. Test with expressions like push constant 3; push constant 4; call Math.multiply (should fold to push constant 12 after inlining Math.multiply when both arguments are constants).

Part 3: Measure. Compile benchmark-inlining.jack with (a) allocator only, (b) allocator + peephole, (c) allocator + peephole + inliner + folding. Record instruction counts and reduction ratios at each stage. Record in Toolchain Diary. Target: combined allocator + peephole + inliner + folding should approach 2x additional reduction over allocator-only.

Part 4 (optional): Implement inlining at the parse-tree level instead of assembly level. Compare the code quality: does tree-level inlining produce better or worse output than assembly-level inlining? Why?

Independent practice

Find a function in your CSA-101 Virtus OS that is called many times from many sites and is large enough that inlining it would be harmful. Argue the case for not inlining it using the code-size metric. Then find a function that is called frequently and is small enough that inlining is clearly beneficial. Verify by hand that the inliner produces the correct body substitution.
The combination of inlining + constant folding can eliminate dead code. If a function is inlined and all its branches fold to known values, the non-taken branch becomes dead code. Write a simple dead-code elimination pass that removes instructions that can never be reached. What analysis does it require?
Toolchain Diary entry: riscv32-unknown-elf-gcc with -O2 -finline-functions. Compile a C function that calls a small helper with constant arguments. Use objdump -d to verify that the helper was inlined and the constant arguments were folded. Annotate the disassembly.
Read about the RISC-V ABI's frame pointer convention. When your inliner substitutes a function body at a call site, what happens to the callee's frame pointer setup (if any)? Does your Virtus OS compiler emit frame pointers? If not, what is lost?

Reflection prompts

Inlining is described as trading code size for call overhead. What happens to instruction-cache efficiency as you inline more aggressively? At what program scale does the instruction-cache pressure from inlining outweigh the eliminated call overhead? (Think in terms of working set size vs cache capacity.)
Constant folding folds 3 + 4 to 7 at compile time. Can your folding pass fold a + 0 to a? What does this require that folding 3 + 4 does not? (This is the boundary between constant folding and algebraic simplification.)
A function that is recursive cannot be inlined without bounding the recursion at compile time. How does gcc handle __attribute__((always_inline)) on a recursive function? What does LLVM do? What should your compiler do?

What's next

Module 6 is the Compiler Explorer module: you will compare your compiler's output against production compilers (gcc -O0, -O2, -O3) on identical C source using godbolt.org. After three modules of optimization work, you will see how close your pipeline comes to production quality on simple functions -- and where it falls short. This module closes the compiler arc of CSA-201. Module 7 returns to hardware with the Sv32 MMU.