Week 10: Tracing Garbage Collection · CSA-201 · Virtus Cyber Academy Classroom

Virtus OS v1's Memory service could allocate objects. It could not free them. This week you add the freeing.

Reading

Required. Bryant and O'Hallaron, CSAPP, Chapter 9, section 9.9 ("Garbage Collection"). The most concise useful treatment of mark-and-sweep. Read 9.9.1 (mark phase), 9.9.2 (sweep phase), and 9.9.3 (conservative collection). The CSA-201 GC is a non-conservative mark-and-sweep (the Virtus OS type system knows which fields are pointers); 9.9.3 is context, not implementation spec.

Required. Petzold, CODE, Ch 22 ("The Operating System"). Return to Petzold's description of OS memory management. Petzold frames the allocator as the OS's most important shared service after scheduling: "allocating memory is the operating system's way of pretending there's more of it than there is." The GC is the mechanism that makes that pretense sustainable over time.

Recommended. Appel, Modern Compiler Implementation, Ch 13 ("Garbage Collection"). Covers mark-and-sweep, stop-and-copy, and generational GC. The stop-and-copy section is worth reading for context on why modern production GCs moved away from mark-and-sweep.

Lecture: Tracing Garbage Collection

The Ch 12 §12.5.4 omission

Virtus OS v1's Memory.alloc allocates objects from a heap and returns a reference (an integer handle). There is no Memory.dealloc. In a long-running OS with many object allocations (the screen service allocates pixel buffers; the math library allocates temporaries), the heap fills and the process crashes. Virtus OS v2 adds Memory.gc: a stop-the-world mark-and-sweep collector that reclaims unreachable objects.

Object layout and the reference graph

For the GC to trace references, it needs to know which words in each object are pointer fields. Virtus OS v2 uses a simple type-header discipline: each allocated object is preceded by a one-word header encoding the object's size (in words) and a pointer-field bitmask:

heap layout:
+----------+  <- object base
| type_hdr |     bits 15:0  = object size (words, not counting header)
|          |     bits 31:16 = pointer bitmask (bit i = 1 means word i is a reference)
+----------+
| field 0  |  <- word 0 (pointer if bit 16 of type_hdr is set)
| field 1  |  <- word 1 (pointer if bit 17 of type_hdr is set)
| ...      |
+----------+

The compiler emits type_hdr values for every class it compiles. The max pointer-field count is 16 (limited by the 16-bit bitmask). This is sufficient for all Virtus OS standard-library classes and most student programs.

Mark phase

The mark phase traces all live objects reachable from the GC roots. GC roots are: all stack frames' local variables and temporaries (the live values on every process's stack), and global variables.

The OS GC driver:

Stops all running user processes (stop-the-world; Module 11 scheduler integration).
Scans each stack frame to collect root references. The compiler emits a "stack map" alongside each function: a bitmask of which stack slots are live references at each safe point.
Pushes all roots onto a worklist.
Pops each reference from the worklist; if the object's mark bit is clear, sets it and pushes all pointer fields onto the worklist.
Continues until the worklist is empty.

The mark bit lives in the object header (bit 15 of the size field; size is limited to 32767 words which is far larger than any Virtus OS object).

Sweep phase

The sweep phase walks the entire heap linearly and reclaims unmarked objects:

ptr = heap_base
while ptr < heap_top:
    hdr = Memory.heap[ptr]
    size = hdr & 0x7FFF
    marked = (hdr >> 15) & 1
    if marked:
        hdr &= ~(1 << 15)     // clear mark bit for next GC cycle
        ptr += size + 1       // advance past header + object
    else:
        // reclaim this object -- add to free list
        free_list_add(ptr, size)
        ptr += size + 1

The free list is a singly-linked list of free blocks sorted by address. Allocation uses first-fit by default (scan from the start of the free list; return the first block large enough). Coalescing adjacent free blocks is done lazily during the sweep phase.

Stop-the-world and the scheduler

The GC must see a consistent snapshot of the heap. If user processes continue allocating and writing references during the GC, the mark phase can miss live objects (a live reference is written after the stack scan). The simplest solution: stop all user processes at safe points (points where the compiler guarantees the stack map is valid), run the full mark-and-sweep, then resume.

For Virtus OS v2, all ECALL points are safe points. The GC runs in M-mode and can pause the scheduler (prevent timer interrupts from firing context switches) while the collection runs.

GC cost model

The cost of one GC cycle:

Mark phase: proportional to the number of live objects (every live object is traced once).
Sweep phase: proportional to the total heap size (the entire heap is walked once).

For a small heap (< 64 KiB, typical for CSA-201 programs), a single GC cycle takes a few thousand cycles. Lab 10.1 measures this and compares against the per-allocation cost of the manual allocator.

Architecture Comparison Sidebar: GC strategies

Strategy	Stop-the-world?	Compacts heap?	Production use
Mark-and-sweep (CSA-201)	Yes	No (free list; fragmentation)	Java 1.0 (historical); Go runtime (partial)
Stop-and-copy	Yes	Yes (live objects copied to new half-space)	JVM young generation (survivor spaces); early Scheme
Generational (minor + major GC)	Minor: yes; major: incremental possible	Young generation compacted; old generation swept	JVM G1/ZGC; .NET CLR; Go
Reference counting	No (instant reclaim of acyclic garbage)	No	Python (CPython); Swift ARC; C++ shared_ptr
Tracing incremental	Incremental (short pauses)	Optional	JVM ZGC/Shenandoah; .NET NativeAOT

Mark-and-sweep's weakness is fragmentation: after many allocate/free cycles, the heap becomes a checkerboard of small free blocks that cannot serve large allocation requests even though total free space is sufficient. Stop-and-copy solves this (all live objects are compacted into contiguous space) at the cost of requiring 2x the heap space. Virtus OS v2's heap is small enough that fragmentation is not a concern over a lab-length session.

Python's CPython uses reference counting as its primary GC, with a cycle detector (tracing GC) to handle reference cycles (objects that refer to each other but are unreachable from any root). Reference counting cannot collect cycles alone.

Lab exercises

See labs/lab-10-tracing-gc.md for the full specification.

Lab 10.1: Tracing GC running on Memory.lib; cycle cost measured. You will add the type-header discipline to Memory.alloc, implement the mark and sweep phases, and measure the cycle cost of one GC cycle on a benchmark heap.

Part A: Update Memory.alloc to prepend a type_hdr word; update the existing allocator to account for the extra word in size calculations.

Part B: Implement Memory.gc: the mark phase (stack scan + worklist traversal) and the sweep phase (linear heap walk + free list rebuild).

Part C: Measure. Create a benchmark that: allocates 100 objects with 2-3 pointer fields each; drops references to half of them (making them garbage); runs Memory.gc; and reads mcycle before and after the GC call. Record the cycle cost. Also verify correctness: objects reachable from the root set must still be accessible after GC; their field values must be unchanged.

Independent practice

The type-header bitmask limits pointer fields to 16. Virtus OS v2's most pointer-rich class (imagine a Node with a left child, right child, parent, and value field -- 3 pointer fields + 1 integer) fits comfortably. What is the largest Virtus OS class you can think of? How many pointer fields does it have?
A mark-and-sweep GC can reclaim objects in cycles (A points to B, B points to A, both unreachable from roots). Draw a heap diagram with such a cycle. Walk through the mark phase and show that both A and B remain unmarked (because neither is reachable from the root set). Show that the sweep phase reclaims both.
Toolchain Diary entry: valgrind. Run valgrind on the host-side (Linux x86_64) version of your allocator code (if you have a host-side harness). Record what valgrind reports about heap usage, and compare against your GC's claim about unreachable objects.
The free list after a sweep phase is sorted by address (because the sweep walks the heap linearly). What is the time complexity of first-fit allocation on a sorted free list of N blocks? What is the time complexity of best-fit? Why do production allocators avoid best-fit?

Reflection prompts

Stop-the-world GC pauses all user processes. In a real-time system (a motor controller, a pacemaker firmware), this pause is unacceptable. What property must a GC have to be suitable for real-time systems? Name one GC algorithm that targets real-time use (hint: incremental tracing, or Erlang's per-process GC).
The stack map (which stack slots are live references) is compiler-emitted. What happens if the compiler generates an incorrect stack map (one that misses a live reference)? The GC will not mark the referenced object; the sweep will reclaim it. The object's memory is then reused for a new allocation. Describe the resulting bug from the perspective of the user-mode process.
Memory.gc is a function the user program calls explicitly. Production GCs are triggered automatically when the heap fills. What information does the allocator need to track to know when to trigger a GC automatically?

What's next

Module 11 adds preemption and a round-robin scheduler. The stop-the-world discipline from this week (all processes paused during GC) depends on the scheduler being pausable. Module 11 implements the timer interrupt, the context-switch mechanism, and the scheduler that decides which process runs next.