Classroom Glossary Public page

CSA-201 Instructor Guide

5,791 words

Skeleton. Modules 7-14 are the primary expansion item for the next build round. Each week section below is a placeholder except where marked "full" (Modules 1-6).


Course philosophy

CSA-201 is a cost-recovery course. Every module is designed to pay back a specific cost that CSA-101 deliberately deferred. Instructors should open each module by naming the cost (usually a Lab forward-promise from CSA-101) before explaining the solution. Students who remember their CSA-101 measurements will have a visceral connection to the improvement.

The Petzold weaves continue here. Do not skip them. The historical narrative is part of the course's voice; students who completed CSA-101 expect it. Read the relevant Petzold sections before teaching each module even if you know the technical content cold. The weave is not a summary of Petzold; it is a connection between Petzold's story and the specific hardware or software the student is building.


Prerequisites check (before week 1)

Before the course starts, verify that every enrolled student can demonstrate:

  1. CSA-101 bitstream flashing on Tang Primer 25K (5 minutes in lab)
  2. The three commands: riscv32-unknown-elf-gcc --version, verilator --version, openFPGALoader --help
  3. Access to their CSA-101 repo with the working assembler, VM translator, and compiler

Students missing any of these should complete the CSA-101 prerequisites before proceeding. A student who cannot flash their Tang board will be blocked by week 1.


Module 1 (full): Full RV32I + M Extension

Opening hook. Pull up the Lab 7.4 measurement from the student's CSA-101 repo. "This is the multiplication cost we paid. Today we eliminate it with one instruction." Do not explain the M extension before showing the cost.

Timing. This is the heaviest lab of the compiler-arc modules (though not the heaviest overall). The register-file widening (Part A1) is mechanical but error-prone because students must update every testbench that references register addresses. Budget 30 minutes of lab time for A1 alone.

Common issue: J-type bit scrambling. The J-type encoding scrambles the immediate bits (imm[20|10:1|11|19:12]) to maximize hardware sharing with B-type. Students who copy the I-type immediate extraction verbatim will get wrong jump offsets. Show the extraction pattern once on the board before releasing to lab.

Common issue: funct7 for SRAI. Students often miss that SRAI and SRLI share the same opcode/funct3 and differ only in funct7[5]. This is the first time they see funct7 as more than a reserved field. Walk through the instruction encoding table before Part A3.

riscv-tests gate. The gate is non-negotiable. Students should not proceed to Part B until all 38 rv32ui-p-* tests pass. Students who get 35/38 should debug the failing 3 before moving on; the usual suspects are: missing SRAI implementation (confused with SRLI), signed vs unsigned branch semantics (BLT vs BLTU), and byte-lane masking in SB/SH (writing only the low byte without corrupting adjacent bytes).

Synthesis timing. Synthesizing a new Tang bitstream takes 5-10 minutes with Gowin IDE. Students should start the synthesis while doing independent practice reading (Petzold Ch 12-13) so they are not idle during place-and-route.

Petzold weave placement. The weave works best as a lecture-opener before the architecture comparison sidebar. Read aloud the relevant paragraph from Petzold Ch 12 (the relay-based adder); then transition directly to: "The M-extension multiplier is what Petzold's relay grid becomes when you replace relays with parallel partial-product adders."


Module 2 (full): Privileged ISA + ECALL Trap

Opening hook. Show the Lab 2.1 target: a waveform where the priv bit flips from 00 to 11 at the ECALL instruction. "This is the moment a user program stops being in charge. This is the hardware foundation of every OS you will ever run."

Timing. The hardware changes (Parts A1-A3) are smaller than Module 1 but correctness is harder to verify without the waveform. Budget 45 minutes for the trap handler in assembly (Part B1); students underestimate how many registers need saving.

Common issue: mscratch / sp swap. The csrrw sp, mscratch, sp idiom at the top of the trap handler confuses students who have not seen atomic register swaps before. Explain it before lab: "after this instruction, sp points to the kernel stack, and mscratch holds the user sp. You need both; this is the only instruction that gives you access to both simultaneously."

Common issue: mepc advancement. Students who forget to add 4 to mepc before MRET get an infinite ECALL loop (the CPU returns to the ECALL, traps again, returns again). This is a good debugging exercise; let students discover it.

rv32mi-p- tests.* The csr test (tests read/write semantics of all CSRs) and the ecall test (tests ECALL raises mcause=11 from M-mode) are the most relevant. Students who fail the illegal test (tests that privileged instructions in U-mode raise mcause=2) may not have implemented the privilege check in their instruction decoder.

Petzold weave placement. Open with Petzold Ch 22 on timesharing mainframes. The relevant passage is the transition from the IBM 7094 single-user batch processing (each job owns the machine) to CTSS at MIT where the first supervisor mode was used to multiplex the machine among multiple users. "The priv bit in your CPU is a direct descendant of the IBM 7090 hardware mode bit described in the CTSS design document."

Connecting to Module 11. Mention now (even though Module 11 is weeks away) that the cycle count measured in Lab 2.1 is the trap overhead. When the scheduler does a context switch, it pays this overhead twice. Students who measure it precisely now will have a meaningful number to compare against in Module 11.


Module 3 (full): Compiler Register Allocator

Opening hook. Pull up the Lab 7.4 objdump -d output on the projector. Count lw/sw instructions together with the class. "Every one of these loads and stores goes to memory. Memory is slower than registers. We are going to make most of them disappear."

Timing. This is the most algorithm-heavy module. Linear scan is not difficult but requires care. Budget 60 minutes for Part B1 (the linear-scan implementation itself). Students who struggle with the expire_old function should be pointed toward the invariant: after expire_old, all intervals in active end after current_start.

Common issue: loop handling in liveness. Students who implement liveness only for straight-line code will produce incorrect intervals for functions with loops. The conservative fix (extend any interval that crosses a loop header to the loop footer) is simple but worth showing explicitly. Without it, the allocator will assign the same register to two different variables that are live simultaneously inside the loop.

Common issue: spill slot numbering. Students must assign distinct stack offsets for each spilled variable. A common bug is re-using offset 0 for every spill, which causes spilled values to overwrite each other. Show the allocate_stack_slot() function pattern before lab.

Measurement calibration. The 3-5x reduction target from Lab 7.4 assumes the benchmark has functions with 10+ locals. If students test with trivial functions (2-3 locals), they will see less improvement and conclude the allocator does not work. Ensure they use the provided benchmark-allocator.jack which has the right characteristics.


Module 4 (full): Compiler Peephole

Opening hook. Show two assembly sequences side-by-side: the raw translator output for a simple push-add-pop and the peephole-optimized version. Count instructions. "Five instructions became two. The peephole saw the pattern; now it fires automatically on every function you compile."

Timing. The unit-test-first structure in Part A is important; enforce it. Students who write all patterns first and then write tests will discover edge cases late and fix them badly. The test harness is simple (diff two files) but the discipline of writing tests before patterns is the pedagogical point.

Common issue: category 1 (push/pop) is tricky. The stack-round-trip pattern depends on knowing the exact stack offsets your translator uses for push and pop. Students whose translator uses a different stack layout than the lab assumes will need to adjust the pattern. Allow 20 extra minutes for students who discover this mismatch.

Run peephole on both allocator-enabled and non-allocator output. Students should measure both combinations (peephole only, allocator + peephole) in Part C. Some patterns (especially load-after-store) fire much less when the allocator has already eliminated most stack accesses; this is worth observing and discussing.


Module 5 (full): Inlining and Constant Folding

Opening hook. Write on the board: int x = Math.abs(-3). "How many instructions does your compiler emit for this?" Then work through what the inliner + folder produces: zero net instructions (the constant -3 folds inside Math.abs to the constant 3, and the result is just li a0, 3). "That is the super-optimization effect."

Timing. The alpha-renaming in Part A2 is subtle. Budget 30 minutes for students to get the register renaming right. The common mistake: students rename callee registers to match caller argument registers (correct) but forget to rename the callee's local temporaries (which may collide with caller temporaries after inlining). Show a concrete collision example before lab.

The --force-inline flag. This flag is a pedagogical tool for Part A3. After students implement it, have them try force-inlining Math.multiply (15 instructions; above the threshold). They will observe code size blowup in the benchmark. The lesson: inlining thresholds exist for a reason.


Module 6 (full): SSA-IR and Compiler Explorer

Opening hook. Open godbolt.org on the projector. Compile a 5-line function at -O0 and -O2 and show the diff. "This is what $50 years of compiler research looks like. Today we find out where our compiler fits in this landscape."

Timing. This module is lighter on implementation (no new code, just measurement and analysis). Use the saved time for discussion of the SSA lecture material, which is conceptually harder than the lab.

godbolt.org reliability. Have a backup plan if godbolt.org is unavailable (a locally installed gcc cross-compiler with the same flags). The lab works either way; godbolt is just more convenient for quick iteration.

Managing expectations for Part C (loop function). The loop gap is large. Students who have done rigorous work in Labs 3-5 will still be significantly behind gcc -O2 on loops because they have not implemented LICM, induction variable optimization, or vectorization. Frame this as: "you now know exactly what it would take to get there."


Module 7 (full): Sv32 Paged Virtual Memory + MMU

Opening hook. Ask the class: "What address does the first instruction of your CSA-101 user program run at?" Answer: physical address 0x00000000 -- the kernel and user program live in the same flat space, trusting each other not to stomp on each other's memory. "Today that ends. Every user program gets its own address space, and the hardware enforces the boundary."

Timing. This is the longest lab in CSA-201. The two-level page walk (Part C2) involves pointer-chasing through physical memory while executing in virtual address space. Budget 90 minutes for Part A (wiring mmu.v) and another 90 minutes for Part C (the page-fault handler). Part B (setting up the page table) is ~30 minutes of mechanical but error-prone work.

Common issue: bare-mode regression. When mmu.v is first wired in with satp.MODE=0, all prior tests should still pass (paddr = vaddr in bare mode). Students who fail the rv32ui-p regression after integrating mmu.v almost always have a wiring error on the satp input. Verify that satp reads as zero at reset (MODE=0 means bare mode).

Common issue: SFENCE.VMA forgotten. After writing a new PTE, the TLB still caches the old translation (if it had been loaded). Students who remap a VA and then immediately access it without executing SFENCE.VMA will get the old PA. The remap test in Part A3 is specifically designed to expose this.

Common issue: A/D bit traps. The mmu.v reference implementation traps if A=0 (access bit not set) or if a store has D=0 (dirty bit not set). When students initialize their page tables in assembly, they often forget to set A and D in the PTE. The resulting fault (mcause=15 on a perfectly valid mapped store) is confusing. Remind students to include PTE_A | PTE_D in all writable PTEs.

Page-fault handler walk-through. Before releasing Part C to lab, walk through the handler algorithm on the board with a concrete example: VA=0x00003000, VPN[1]=0, VPN[0]=3. Show the arithmetic: L1_addr = (satp.PPN << 12) | (VPN[1] << 2), L0_addr = (L1_PTE.PPN << 12) | (VPN[0] << 2). Students who understand the indexing arithmetic find the code straightforward. Students who do not will write the code incorrectly.

Petzold weave placement. Petzold Ch 16 covers the 8086 segment registers -- base + limit -- as the original hardware memory protection mechanism. The weave: "The 8086 had one segment per program. The 386 extended this to four: code, data, stack, extra. Sv32 takes it further: every process gets its own complete address space, not just separate segments. The two-level page table is the mechanism that makes this scale." Read the relevant passage before the architecture comparison sidebar.

Architecture comparison sidebar notes. The four-column table in week-7 (6502 / x86-16 / x86-32 / RISC-V Sv32) is the most important in the course. x86-32's PAE extension (36-bit physical addresses through a three-level table) and x86-64's four-level walk are natural forward pointers for students continuing to VCA-X86-201.

Reference implementation: vca-csa-201/hdl/core/mmu.v (F7 dispatch; sim-certified; 1-entry iTLB + 1-entry dTLB; walker FSM S_IDLE/S_L1/S_L0/S_DONE; A/D bit trap policy).


Module 8 (full): PMP + W^X Enforcement

Opening hook. Reference the Ch 12 §12.11 exploit from the course outline: "write shellcode to the stack, branch to it." Without PMP, this works because the stack is both writable and executable. With PMP W^X, either the write fails (store-access fault) or the branch fails (instruction-access fault). The exploit requires both to succeed; W^X ensures at least one fails. "This is the hardware foundation of every modern OS's exploit mitigation."

Timing. Part A (wiring three PMP unit instances) is ~45 minutes for most students. The wiring is mechanical but the number of ports is large; use a cheat-sheet with the exact port names from pmp.v. Part B (writing the setup routine and running four test cases) is ~60 minutes. Part C (synthesizing with SignalTap and demonstrating on DE10-Nano) is the longest -- budget 2 hours for Quartus synthesis plus hardware testing.

Common issue: L bit misuse. Students sometimes set L=1 (Lock) on all PMP entries to "be safe." L=1 prevents the M-mode kernel from modifying the entry (even in M-mode with MIE=0). If a student locks an entry with the wrong base or permissions, they cannot fix it without a hardware reset. Advise: leave L=0 for the lab; only set L=1 if you want the entry to survive a privilege escalation attack (not needed for these labs).

Common issue: priority confusion. PMP entry 0 has the highest priority. Students who put a wide permissive entry at index 0 and a restrictive entry at index 1 will find the restrictive entry never fires (the permissive entry matches first). The W^X configuration must put the code segment entry (R+X, no W) and the data segment entry (R+W, no X) as the first two entries.

Common issue: fault cause vs PMP access type. The three pmp_unit instances have different fault causes: u_pmp_fetch fires instruction-access fault (mcause=1), u_pmp_load fires load-access fault (mcause=5), u_pmp_store fires store-access fault (mcause=7). Students who wire all three to the same fault cause will fail the Verilator test cases.

SignalTap setup notes. When synthesizing with SignalTap, the tap must be on the post-PMP signal (store_pmp_ok, fetch_pmp_ok) rather than the memory bus directly. The memory bus is wide and would exceed the SignalTap sample depth. Triggering on store_pmp_ok == 0 captures exactly the moment a write is blocked.

Petzold weave placement. Petzold Ch 16 continues here. The 8086 had read-only segments via the segment descriptor table. The 386 added separate privilege rings (ring 0/1/2/3). PMP is the RISC-V equivalent: a hardware table that the OS programs at boot to partition physical memory. The key difference from the segment descriptor table: PMP entries apply to physical addresses (after MMU translation), making them composable with the Sv32 MMU rather than a replacement for it.

Reference implementation: vca-csa-201/hdl/core/pmp.v (F5 dispatch; sim-certified; 8-entry cascade; OFF/TOR/NA4/NAPOT addressing modes; three instances in cpu.v).


Module 9 (full): Stack Canaries and CFI Shadow Stack

Opening hook. Show the classic buffer-overflow diagram: local buffer at the bottom of the stack frame, saved ra above it. "A 33-byte write into a 32-byte buffer overwrites the saved return address. The CPU does not know the difference between a legitimate ra and a forged one. The canary is the CPU's lie detector for this specific case."

Timing. Part A (compiler modification for canary prologue/epilogue) is ~60 minutes. Students who have a clean compiler architecture from Labs 3-5 can add the canary emission in one function. Students with tangled code may need to refactor before adding the canary. Part B (PMP-protected shadow stack) is ~45 minutes. Part C (integration test and measurement) is ~30 minutes.

Common issue: CANARY_OFFSET arithmetic. The canary sits below the saved ra in the stack frame. If the compiler places locals above the saved ra (growing upward), the canary does not protect against a buffer overflow that overwrites ra from below. Verify the stack frame layout before the canary offset calculation: [saved fp] [saved ra] [canary] [locals...], growing downward.

Common issue: canary check timing. The epilogue check must happen before the ret instruction. Students who add the check after ret (a logical impossibility, but students sometimes write the assembly epilogue in the wrong order) will not catch overflows. Verify the assembly output with objdump -d before testing.

Shadow stack system call overhead. SYS_SHADOW_PUSH and SYS_SHADOW_POP each require an ECALL. With the Lab 2.1 trap overhead measured (~20-40 cycles per ECALL), the shadow stack adds 2 ECALL per function call. For hot inner loops, this is significant. The overhead measurement in Part C quantifies this; use it to discuss when CFI is worth the cost.

The ROP demonstration. Lab 9.1 Part B3 asks students to construct a "ROP gadget redirect" by overwriting ra before a function returns. This is the closest thing in the CSA-201 lab sequence to an actual exploit technique. Brief students on the pedagogical purpose: we demonstrate the attack to verify the defense. The shadow stack should catch the forged ra and terminate the process.

Petzold weave placement. Petzold Ch 22 covers the history of operating system security. The relevant passage is on the Morris Worm (1988), which used a buffer overflow in fingerd to overwrite ra and redirect execution. "The Morris Worm did in 1988 exactly what the demo program in Part B3 does. The stack canary did not exist yet. The CFI shadow stack did not exist yet. We are building defenses that would have stopped it." Frame the lab as building what the 1988 internet lacked.


Module 10 (full): Tracing Garbage Collection

Opening hook. Open the CSA-101 Memory.lib source. "Every Memory.alloc returns a block. Every block that is no longer referenced is leaked. Over time, the heap fills up and new allocations fail." Show a simple linked-list program that creates 1000 nodes and discards the last 500. "Half the heap is garbage. The GC reclaims it automatically."

Timing. Part A (type-header instrumentation) is ~45 minutes. The pointer arithmetic (returned pointer is one word past the header) must be consistent: alloc returns ref, header is at ref-1, size is at (ref-1)[14:0]. Part B (mark and sweep) is the longest at ~90 minutes. Part C (measurement) is ~30 minutes.

Common issue: mark bit placement. The mark bit is bit 15 of the header word (the boundary between the size field and the ptr_bitmask). Students who put the mark bit at bit 0 of the header corrupt the size field. Students who put it at bit 16 corrupt the lowest ptr_bitmask bit. The reference implementation uses bit 15 specifically because the size is constrained to 15 bits (max 32,767 words) and the bitmask is 16 bits. Verify the bit field layout before starting the mark phase.

Common issue: worklist overflow. The worklist in gc_mark() has a fixed size (MAX_ROOTS). If the live set contains more objects than MAX_ROOTS, the GC will miss reachable objects and prematurely free live data. For the lab, MAX_ROOTS=64 is sufficient (the benchmark uses at most 50 live nodes). Production GCs use a dynamic worklist (or iterative deepening to bound stack depth). Mention this as a known limitation.

Conservative scan discussion. Lab 10.1 Part A2 offers a conservative stack scan as an alternative to emitting precise stack maps. Walk students through why conservative scanning produces false positives: an integer that happens to equal a valid heap address will be treated as a live reference, keeping the object alive longer than necessary. This is a space leak, not a correctness bug (no live object is freed). Ask: "Is a space leak acceptable in a security context?" (Answer: it depends on whether the leaked memory can be inferred by an attacker to learn about heap layout.)

Stop-the-world and the scheduler. The GC must disable the timer interrupt before gc_mark() begins. If a context switch fires mid-mark, the new process may allocate into memory that the GC has already swept but not yet added back to the free list. Walk students through the race condition before lab. The fix is one line (csrc mie, MTIE) but the underlying concurrency issue is worth explaining.

Petzold weave placement. Petzold Ch 22 discusses Lisp and its heritage. The weave: "John McCarthy's Lisp (1958) was the first language with automatic memory management. The garbage collector was not an afterthought; it was the mechanism that made Lisp's recursive list processing practical. The mark-and-sweep algorithm we implement in Lab 10.1 is recognizably similar to McCarthy's original 1960 description."


Module 11 (full): Preemption and Round-Robin Scheduler

Opening hook. Run the Lab 11.1 two-task demo on DE10-Nano (or simulate it in Verilator if hardware is not ready). Both OLED lines advance simultaneously. "Before today, your Virtus OS ran one program at a time. A single infinite loop would freeze the machine. Now the timer fires every 10 ms, saves the running program's state, and runs the next one. The CPU gives the illusion of simultaneity."

Timing. Part A (CLINT wiring) is ~60 minutes. The CLINT is a separate hardware module from the main CPU; students must wire irq_mtip to the CPU's MIP.MTIP register and ensure the interrupt fires the trap handler. Part B (context switch) is the most challenging: ~90 minutes. The register save/restore must be complete (all 32 registers) and must use a consistent ABI with the trap frame. Part C (two-task demo and measurement) is ~45 minutes.

Common issue: mtimecmp atomicity. The CLINT mtimecmp register is 64-bit, but your CPU writes 32-bit words. If you write the low word first, a stale mtime value may temporarily exceed the new mtimecmp (triggering a spurious interrupt). The correct sequence: write lo=0xFFFFFFFF (push deadline far future), write hi (if needed), then write the actual lo. This is documented in the clint.v source; remind students to read the reference implementation header before coding.

Common issue: all 32 registers. Students who save only the callee-saved registers (x18-x27) will corrupt the caller-saved registers (x5-x7, x10-x17, x28-x31) of the interrupted process. The context switch must save all 32 registers. The easiest debugging check: add a canary pattern (0xDEAD0000 | reg_num) to all 32 registers before a context switch, verify they are restored correctly after the switch.

Common issue: mepc vs mepc+4. Unlike ECALL (where mepc must be advanced by 4 before MRET), a timer interrupt saves the interrupted instruction's PC as mepc. MRET must return to the interrupted instruction, not mepc+4. Students who add 4 to mepc in the timer interrupt handler will skip one instruction per preemption -- a subtle bug that manifests as incorrect loop termination.

Context-switch cost target. The expected incremental cost (over the Lab 2.1 trap overhead) is ~70-100 cycles. This comes from: 32 SW instructions (save) + 32 LW instructions (restore) = 64 memory operations at 1-2 cycles each = 64-128 cycles. If students measure significantly more, look for unnecessary memory accesses in the scheduler path. If they measure significantly less, verify that all 32 registers are actually being saved.

Petzold weave placement. Petzold Ch 22 opens with timesharing mainframes. The weave: "The IBM 7094 had no timer interrupt. A job ran until it called for the next one. CTSS at MIT added a clock interrupt to the 7090 hardware specifically to enable preemption. The CLINT module in your CPU is a direct descendant of that MIT hardware modification. The timer fires; the trap handler saves the current job's state; the scheduler picks the next. Sixty years later, the algorithm is unchanged."

Reference implementation: vca-csa-201/hdl/core/clint.v (F13 multi-hart CLINT; irq_mtip level-triggered from mtime >= mtimecmp; MMIO base 0x02000000; sub-word write atomicity discipline).


Module 12 (full): Driver-Writing Track

Opening hook. Hold up the SSD1306 OLED module. "This costs $3. Your CPU has no driver for it. The datasheet is 60 pages. By the end of lab, you will have a working driver that can display any character on this screen -- and you will know exactly why every byte in the initialization sequence matters."

Timing. Part A (I2C bit-bang) is ~60 minutes. Part B (SSD1306 init) is ~60 minutes; debugging an incorrect initialization sequence on physical hardware without an oscilloscope is tedious, so strongly encourage students to instrument with a logic analyzer or SignalTap before starting. Part C (character rendering) is ~45 minutes.

Common issue: I2C START/STOP timing. The SDA-falls-while-SCL-high (START) and SDA-rises-while-SCL-high (STOP) conditions must be generated by the software, not the hardware. Students who generate START by driving SCL low before SDA will generate a valid I2C byte, but the first bit of the address byte will be corrupted (the slave sees the SCL-low as a clock edge before the address bit is stable). Show the waveform pattern for START on the board before lab.

Common issue: initialization sequence order. The SSD1306 init sequence is order-sensitive. Some commands take effect immediately; others configure internal state that is applied later. Students who reorder the init sequence (to "skip the ones they don't understand") end up with a display that shows nothing, or shows only pixels from a previous init. Advise: send all init commands in the order shown in the lecture; do not skip any.

Logic analyzer first policy. Before students look at the OLED, they should verify the I2C waveform with a logic analyzer or SignalTap. An incorrect waveform will not show an ACK; a missing pull-up resistor will show SDA stuck high even when the CPU drives it low. Students who skip this step spend hours debugging at the OLED level when the problem is at the I2C bit level.

ENC28J60 note. The ENC28J60 SPI driver (mentioned in the overview) is an optional extension in Lab 12.1. Students who want to continue to VCA-NET-201 should implement it. The key debugging challenge is SPI clock polarity: CPOL=0, CPHA=0 for the ENC28J60 (data sampled on rising edge). A wrong polarity setting corrupts all register reads silently -- the read returns 0xFF instead of the register value.

Toolchain Diary entries. Module 12 is explicitly a Toolchain Diary emphasis module. Every new tool -- sigrok/pulseview for the logic analyzer, SignalTap for on-chip probing, a scope for analog verification -- should get a Diary entry. Grade these specifically in Module 12 to reinforce the habit before the capstone.

Petzold weave placement. Petzold Ch 16+18 cover device I/O. The weave: "Petzold's telegraph relay (Chapter 1) was the first device driver: a human operator who knew the protocol for encoding dots and dashes. The SSD1306 datasheet is the protocol. Your driver is the automated operator. The difference between a hardware engineer who reads datasheets and writes drivers, and a software engineer who uses pre-built libraries, is that the first one can build what does not yet exist."


Module 13 (full): External DRAM and FAT16 Filesystem

Opening hook. Print out the BRAM utilization report from the current synthesis. "We have 56 block RAMs. Each is 18 Kbits = 2.25 KiB. Total: 126 KiB. Your stack alone is 4 KiB. Your kernel is 8 KiB. You have 114 KiB left for the heap. A linked list of 1000 32-byte nodes uses 32 KiB. Three such lists and your heap is exhausted." Then reveal: "The DE10-Nano has 1 GB of DDR3. Today we connect to it."

Timing. Part A (DDR3 address map and heap extension) is ~60 minutes. The memory decoder changes are mechanical but must be tested carefully. Part B (FAT16 walker) is the longest at ~90 minutes. Part C (config file integration) is ~30 minutes.

Common issue: AXI bridge latency. The HPS F2H_SDRAM bridge introduces variable latency (~10-100 cycles per access depending on DRAM row state). Students who write tight timing loops that assume 1-cycle memory access (as in BRAM) will fail. The DDR3 latency measurement in Part A3 should be done before any latency-sensitive code is written.

Common issue: FAT16 endianness. FAT16 structures use little-endian byte order (same as RISC-V). Students who wrote their parser on a big-endian machine (or used ntohl/htonl) may byte-swap inadvertently. Verify the boot sector magic bytes (0x55, 0xAA at offsets 510-511) as a sanity check before parsing BPB fields.

Common issue: cluster 2 is the first data cluster. FAT16 reserves cluster values 0 (free) and 1 (reserved) in the FAT table. The first data cluster is cluster 2. Students who compute data_start correctly but then use cluster 0 or 1 to index into the data area will read the FAT table or reserved sectors instead of file data. Show the formula data_lba = data_start + (cluster - 2) * sectors_per_cluster explicitly.

FAT16 scope limitation. Lab 13.1 implements a read-only FAT16 walker. Write support (creating or modifying files) requires updating both the FAT table and the directory entry, with careful handling of cluster allocation. Read-only is sufficient for the capstone (reading BOOT.CFG); write support is offered as an advanced extension for VCA-EMB-201-bound students.

DDR3 as heap vs filesystem buffer. Students sometimes confuse the two DDR3 uses: (1) the heap extension (DDR3 as object allocation space, managed by the GC) and (2) the filesystem sector buffer (DDR3 as a large I/O buffer for FAT16 sector reads). These are separate uses of the same physical address space. Ensure the heap allocation region (0x10000000 and up) does not overlap the filesystem buffer (which should be a fixed address in the lower DDR3 range, or allocated from the heap itself).

Petzold weave placement. Petzold Ch 14 covers magnetic storage (disk drives and tape). The weave: "Petzold's storage chapter ends with the hard disk -- a spinning platter with a head that reads and writes sectors. The SD card is the same abstraction: sectors of 512 bytes, addressed by LBA. The FAT16 filesystem on top of it is the same structure that DOS used on 5.25-inch floppy disks in 1984. The physical medium changed from spinning iron to flash; the logical structure is unchanged."


Module 14 (full): Capstone Integration Week

Instructor role. Module 14 is not a lecture module. The instructor's role is: (1) hold daily office hours during the integration sprint, (2) triage gate failures and point students to the correct lab section, (3) verify that Tier 1 gate demonstrations are live on hardware (not Verilator), and (4) evaluate Tier 2 write-ups.

Gate triage approach. When a student reports a gate failure, start with the gating lab rather than the gate itself. Gate 3 (page-fault handler) fails most often because Lab 7.1 Part C2 was not fully debugged. Gate 5 (scheduler) fails most often because the context switch (Lab 11.1 Part B2) has a register save/restore bug. Gate 6 (OLED) fails most often because the SSD1306 initialization (Lab 12.1 Part B2) is incomplete.

Common integration failure: SFENCE.VMA after context switch. When the scheduler restores a different process's satp, the TLB may still hold stale translations from the previous process's address space. The scheduler must execute SFENCE.VMA after writing satp. Students who omit this will see random page faults (the TLB returns a stale PA for the new process's VA).

Common integration failure: GC during a context switch. If Memory_gc() is triggered during the timer interrupt handler (because the interrupt fires while a process is mid-allocation), and the GC does not properly save/restore the process state before and after sweeping, the process's heap references in its registers may point to freed memory. The stop-the-world discipline (disable timer during GC) must be strictly enforced.

Tier 2 write-up evaluation. Section 4 (what does not work) is where students earn credibility. A write-up that claims everything works perfectly but has a Gate 4 failure in the video will score poorly. A write-up that honestly documents "PMP covers only two regions; DMA bypasses it; the ENC28J60 driver writes to registers but has not been tested with actual packets" demonstrates engineering rigor. Grade section 4 generously if the limitations are specific and accurate.

Connecting to subsequent courses. Close Module 14 with a brief map of what Virtus OS v2 unlocks. ARM-201 takes the same OS architecture and port it to AArch64. EMB-201 extends the driver track to a full embedded system with sensors and actuators. NET-201 uses the ENC28J60 driver as the starting point for a full TCP/IP stack. X86-201 studies how the x86_64 ISA implements the same concepts (paging, PMP equivalent in SMEP/SMAP/PKU, context switch via swapgs). The six Tier 1 gates they just demonstrated are the foundation for all four paths.

See CAPSTONE.md for the full rubric. The instructor's primary reference during Module 14 is CAPSTONE.md, not this guide.


Toolchain Diary grading (ongoing)

Toolchain Diary entries are collected at the end of each module. Grading criteria:

  • Entry is present for every tool listed in SETUP.md that the module uses: 1 pt per tool
  • Entry includes version, command line, and observation (not just the tool name): full credit
  • Entry includes a connection to the module's content (why this tool matters for what you built): bonus 0.5 pt per entry

The Toolchain Diary is not graded for quality of writing; it is graded for completeness and specificity.


Differentiation for advanced students

Students who finish module labs early should extend in these directions:

  • Module 1: Implement pipelining (2-stage or 5-stage pipeline for the RV32I core); measure the IPC improvement on the speedup benchmark.
  • Module 3: Implement graph-coloring allocation (instead of linear scan) for a benchmark function; compare code quality.
  • Module 7: Implement a 64-entry TLB with random eviction; measure the TLB miss rate on a benchmark that uses a large working set.
  • Module 12: Add ENC28J60 full packet send/receive; implement a minimal ARP + ICMP responder; demonstrate ping response.
  • Module 13: Implement FAT16 write support; write a file from the OS to the SD card.

These extensions are not required for any grade; they are offered as forward-stretch exercises for students who will continue to VCA-EMB-201 or VCA-NET-201.