CSA-101 closed at the system line: a Tang Primer 25K running an OS the student wrote, on a CPU the student synthesized. Every layer the chapter omitted was named. CSA-201 pays those debts.
Course mission and audience
CSA-201 is the academy's Part-II anchor course. Students arrive with a sim-certified and silicon-certified RV32I-Lite CPU, a working compiler toolchain, and Virtus OS v1 running on Tang Primer 25K. Every layer CSA-201 touches is one the student already owns in a simplified form. The course is a systematic expansion: full RV32I plus the M extension, the privileged ISA, compiler improvements, virtual memory, memory protection, and the OS services that depend on all of the above.
The audience is CSA-101 graduates. Entry requirements: RV32I-Lite CPU synthesized and running; assembler, VM translator, and compiler working end-to-end; Virtus OS v1 booting and running at least three standard-library services on silicon.
Position in the pipeline. Belt 5/5. Part-II anchor. Prerequisites CSA-201 for all six named Part-II electives: VCA-ARM-201, VCA-NET-201, VCA-EMB-201, VCA-NET-301, VCA-X86-201, VCA-VCA-MIPS-201. Cross-track: the XD-strand capstones (XD1 stack/shellcode, XD2 mitigations/ROP, XD3 heap/format strings) all attack Virtus OS v2 built here.
What you will know at the end
-
Remember. Recite the full RV32I instruction set (47 base instructions, 4 formats). Name the six CSR instructions. State the RISC-V privilege levels (M, S, U) and the trap mechanism (ECALL, mtvec, mepc, mcause, MRET). Name the Sv32 two-level page table structure. Name the PMP register layout (pmpcfg, pmpaddr, modes TOR/NA4/NAPOT).
-
Understand (hardware). Explain how Zicsr augments the datapath with a separate CSR address space; how trap delivery works cycle-by-cycle; how an MMU translates virtual to physical addresses with a two-level page walk; how PMP blocks regions before they reach the memory bus.
-
Understand (compiler). Explain what a register allocator does and why naively spilling to the stack costs code size; what a peephole pass finds in a local window; why inlining trades code size for call overhead; what SSA form enables that straight assignment cannot.
-
Apply (hardware). Extend your CSA-101 CPU to full RV32I: widen the register file to 32 entries, add the M-extension multiplier/divider, add Zicsr, add privilege modes with trap delivery, add Sv32 MMU with a TLB, add PMP. Synthesize each step; verify with riscv-tests.
-
Apply (compiler + OS). Add a register allocator pass to your CSA-101 compiler; measure emit reduction. Add peephole and inlining passes. Run your compiler against the same source on godbolt.org to compare with gcc at -O2. Build Virtus OS v2: U/S split, page tables, PMP, round-robin scheduler, SSD1306 + SD-card + ENC28J60 drivers.
-
Analyze (cross-layer). Trace a page fault from a user-mode memory access: the MMU raises the fault, the supervisor trap handler services it, the process resumes. Then trace a stack-smash attempt: the PMP W^X policy intercepts the write to the code page before it reaches the ALU.
-
Create (capstone). Deliver Virtus OS v2 running on DE10-Nano: U/S privilege transition demonstrated; page-fault handler running; PMP W^X enforced; round-robin scheduler context-switching two tasks; SSD1306 OLED showing output; SD-card filesystem reading a file. ~4,000 lines total across kernel + drivers, vs CSA-101's ~1,500.
Course shape (14 weeks)
| Week | Theme | Lab | Petzold weave anchor | Architecture comparison sidebar |
|---|---|---|---|---|
| 1 | Full RV32I + M extension | Lab 1.1: mul vs Math.multiply speedup measured |
Petzold Ch 12 + Ch 13 (binary multiplication and the cost of iteration) | M-extension: ATmega software-mul (2 registers, 16 cycles) vs RV32IM mul (single instruction) vs MIPS mult (HI/LO registers) |
| 2 | Privileged ISA + ecall trap | Lab 2.1: first user-to-supervisor transition; cycle cost measured | Petzold Ch 22 (the OS chapter: supervisor mode emerges from timesharing mainframes) | Trap delivery: x86_64 SYSCALL/SYSRET vs RISC-V ECALL/MRET vs ARM SVC/ERET |
| 3 | Compiler register allocator | Lab 3.1: allocator pass added; emit reduction observed | Petzold Ch 24 + Ch 17 (from machine code to language and back; what a register is worth) | Register files: RV32I 32-GPR vs x86_64 16-GPR + 16 XMM vs AArch64 31-GPR; the RISC philosophy |
| 4 | Compiler peephole optimisation | Lab 4.1: peephole pass; ~30% smaller assembly per §11.9 5-categories | Petzold Ch 24 (high-level language compilation; local windows and the assembler's view) | Peephole scope: LLVM MachineInstr window vs GCC RTL peephole vs hand-written RISC-V idioms |
| 5 | Compiler inlining + constant folding | Lab 5.1: inliner pass; library-call overhead measured before/after | Petzold Ch 22 + Ch 24 (procedure calls and their costs) | Inlining policy: GCC/Clang heuristics vs JVM JIT threshold vs your compiler's naïve model |
| 6 | SSA-IR + Compiler Explorer (godbolt.org) | Lab 6.1: compare your compiler output vs gcc -O0/-O2/-O3 on identical source | Petzold Ch 24 (the long arc from machine code to optimizing compilers) | SSA-IR: LLVM IR vs GCC GIMPLE vs WebAssembly; why static analysis needs single-assignment form |
| 7 | Sv32 paged virtual memory + MMU | Lab 7.1: Sv32 paged VM running; page-fault handler demonstrated | Petzold Ch 16 + Ch 14 (memory hierarchy; segment registers and why hardware abstracts addresses) | Virtual memory: x86_64 CR3 + 4-level paging vs Sv32 2-level vs AArch64 TTBRn; TLB shootdown on SMP |
| 8 | PMP + W^X enforcement | Lab 8.1: PMP-defended stack-smash; same exploit from Ch 12 §12.11 now traps cleanly | Petzold Ch 16 (memory protection; why hardware rings exist above software) | Privilege rings: Linux S/U/M three-layer vs Windows ring 3/0/HV vs bare-metal M-only |
| 9 | Stack canaries + CFI | Lab 9.1: stack canary detects return-address overwrite; CFI shadow stack catches ROP | Petzold Ch 22 (what the OS protects and what it cannot) | CFI mechanisms: x86_64 CET (SHSTK + IBT) vs AArch64 PAC+BTI vs RISC-V Zicfilp/Zicfiss |
| 10 | Tracing garbage collection | Lab 10.1: tracing GC running on Memory.lib; cycle cost measured | Petzold Ch 22 (the OS manages memory so programs don't have to; GC as the logical extreme) | GC strategies: mark-and-sweep vs copying vs generational; JVM G1 vs Python reference-counting vs Go tricolor |
| 11 | Preemption + scheduler | Lab 11.1: round-robin scheduler; two demo tasks; context-switch cost measured | Petzold Ch 22 (timesharing: the original reason for supervisor mode) | Schedulers: Linux CFS vs Windows dispatcher vs RTOS fixed-priority vs your round-robin |
| 12 | Driver-writing track | Lab 12.1: SSD1306 OLED driver from datasheet; output verified | Petzold Ch 16 + Ch 18 (buses; peripherals; I2C and SPI as descendants of serial ideas Petzold traces) | I2C vs SPI vs UART: protocol overhead; SSD1306 command-byte protocol vs ENC28J60 SPI frame |
| 13 | External DRAM + filesystem | Lab 13.1: SD-card filesystem walker reads FAT16 partition | Petzold Ch 14 + Ch 16 (memory hierarchy; DRAM timing; why flash storage is not RAM) | Storage stacks: BRAM on Tang (< 1 MiB ceiling) vs SD-card FAT16 vs NVMe + ext4; endianness in FAT |
| 14 | Capstone, Virtus OS v2 on DE10-Nano | Full capstone (see CAPSTONE.md) | Closing reflection on the full ladder from CSA-101 through CSA-201 | The bridge talk: where to take Virtus OS v2 next (ARM-201, EMB-201, NET-201, CON-201) |
Anchor readings
Primary (continued from CSA-101 at advanced depth).
Patterson and Hennessy, Computer Organization and Design: RISC-V Edition (Morgan Kaufmann). The chapter coverage for CSA-201: Appendix B (RISC-V ISA reference, full RV32I + M); Chapter 4 (pipelining; optional but rewarding for Module 1 context); Chapter 5 (memory hierarchy; pairs with Modules 7 and 13); Chapter 2 (instructions; pairs with Modules 1 and 2). Use as a reference, not a reading schedule; the lab exercises are self-contained.
Petzold, CODE: The Hidden Language of Computer Hardware and Software, 1st edition (1999). CSA-201 adds ~25 new weaves across its chapters. Priority chapters: Ch 12 and Ch 13 (binary multiplication; the M extension pays back Math.multiply's iteration cost in a single instruction); Ch 14 and Ch 16 (memory addressing; the MMU chapter traces directly to Petzold's treatment of how hardware abstracts addresses); Ch 22 (the operating system; supervisor mode emerges from timesharing mainframes; the privilege chapter opens with this). The 2nd edition (2022) covers the same material; locate passages by section title.
Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd edition (Pearson). Use chapters: 7 (linking; pairs with the register allocator discussion), 9 (virtual memory; the definitive reference for the Sv32 module), 3 (machine-level representation; pairs with Modules 3-6 compiler work), 8 (exceptional control flow; the syscall and trap mechanism pairs with Module 2).
Secondary. Waterman and Asanovic, The RISC-V Instruction Set Manual, Volume I: Unprivileged Architecture (open access). Volume II: Privileged Architecture (open access). These are the authoritative ISA specs; use them to verify instruction encodings and CSR definitions during labs.
Per-week time budget
| Activity | Hours per week | Hours over 14 weeks |
|---|---|---|
| Lecture | ~2.5 hr | ~35 hr |
| Lab (hands-on) | ~4 hr | ~56 hr |
| Independent practice (reading + repo work + Toolchain Diary) | ~6 hr | ~84 hr total |
| Capstone integration weeks | -- | +5 hr (weeks 13-14) |
| Total | ~12.5 hr/week | ~180 hr |
The heaviest weeks are 1 (M-extension hardware integration), 7 (Sv32 MMU page-table walker), and 14 (capstone integration). Budget an extra 2-3 hours in each. The compiler weeks (3-6) run lighter on hardware and heavier on repo editing.
Lab arc summary
Each lab measures a specific cost that CSA-101 deliberately paid and then recovers it.
| Lab | What it measures | What it recovers |
|---|---|---|
| Lab 1.1 | Math.multiply ~1,000-cycle vs mul single-cycle speedup |
CSA-101's software-multiply dependency |
| Lab 2.1 | ecall trap round-trip cycle cost | CSA-101's M-mode-only baseline |
| Lab 3.1 | Compiler emit size before/after register allocator | Lab 7.4's translator-bloat forward-promise |
| Lab 4.1 | Assembly line count before/after peephole pass | Lab 11.4's 5-categories forward-promise |
| Lab 5.1 | Call overhead before/after inliner | Ch 11 library-call overhead |
| Lab 6.1 | godbolt.org -O0/-O2/-O3 comparison on identical source | Production-grade compiler reference |
| Lab 7.1 | Page-fault handler under Sv32 | Flat physical address space |
| Lab 8.1 | PMP-defended stack smash that traps cleanly | Ch 12 §12.11 W^X absence |
| Lab 9.1 | Stack canary + ROP detection | CSA-201's newly-enabled W^X and CFI primitives |
| Lab 10.1 | Tracing-GC cycle cost on Memory.lib | Ch 12 §12.5.4 manual-only allocator |
| Lab 11.1 | Context-switch cost, two-task round-robin | Ch 12 §12.1 single-task baseline |
| Lab 12.1 | SSD1306 driver written from datasheet | IP-Pack black box for OLED |
| Lab 13.1 | FAT16 partition walk on SD card | Tang BRAM ceiling; no persistent storage |
| Lab 14 | Full Virtus OS v2 capstone on DE10-Nano | Everything above, integrated |
Hardware requirements
Required. Tang Primer 25K (carried over from CSA-101; your existing bitstream and board). DE10-Nano Cyclone V FPGA development board (~$130, student-purchased; Terasic). Pi 4 station kit (~$80; USB-C power, HDMI cable, microSD 16 GB minimum, USB keyboard). Quartus Prime Lite 23.1 or later (free, Intel/Altera; installs on Linux x86_64 or Windows). riscv32-unknown-elf toolchain (prebuilt binaries at riscv.org or build from source).
Optional but recommended. Logic analyzer (Saleae Logic 8 or open-source equivalent) for driver debugging in Modules 12-13. External DRAM module compatible with DE10-Nano I/O board for Module 13 extended exercises.
Yowasp browser path. The academy workbench Tab 3 supports Tang Primer 25K and Tang Nano 20K bitstream synthesis in-browser (Yosys + nextpnr-himbaechel + apicula). Use this for pre-Quartus sanity checks on RISC-V HDL changes before the full DE10-Nano Quartus build.
See SETUP.md for installation instructions.
Assessment overview
Tier 1 (pass/fail gate). Virtus OS v2 boots on DE10-Nano. Demonstrates U/S privilege transition. Page-fault handler handles at least one synthetic fault. PMP W^X policy intercepts a write to a code page. Round-robin scheduler switches between two running tasks. SSD1306 OLED shows live output. All six gates must pass for a Tier 2 score to count.
Tier 2 (40/30/30). 40% mitigation depth (does your OS actually enforce security properties, or just simulate them?); 30% measurement quality of speedups and cost recovery vs CSA-101 baseline; 30% demo and 6-8 page write-up. B- minimum on Tier 2 for the VCA-CSA-201 Certificate of Completion.
See CAPSTONE.md for the full rubric.
Continuation note
This build round covers Modules 1-6 (weeks 1-6) in full, plus the CAPSTONE rubric and INSTRUCTOR-GUIDE skeleton. Modules 7-14 (weeks 7-14) are outlined above and in CONTINUATION.md; their content files are the primary deliverable for the next build round.
Weeks remaining for next round: week-7-sv32-mmu.md, week-8-pmp-wx.md, week-9-stack-canaries-cfi.md, week-10-tracing-gc.md, week-11-preemption-scheduler.md, week-12-driver-writing.md, week-13-external-dram-filesystem.md, week-14-capstone-delivery.md, plus labs/lab-7 through labs/lab-14. The INSTRUCTOR-GUIDE should also be expanded per-week by the continuation round.