CSA-201: Computer Systems Architecture II, Course Outline · CSA-201

CSA-101 closed at the system line: a Tang Primer 25K running an OS the student wrote, on a CPU the student synthesized. Every layer the chapter omitted was named. CSA-201 pays those debts.

Course mission and audience

CSA-201 is the academy's Part-II anchor course. Students arrive with a sim-certified and silicon-certified RV32I-Lite CPU, a working compiler toolchain, and Virtus OS v1 running on Tang Primer 25K. Every layer CSA-201 touches is one the student already owns in a simplified form. The course is a systematic expansion: full RV32I plus the M extension, the privileged ISA, compiler improvements, virtual memory, memory protection, and the OS services that depend on all of the above.

The audience is CSA-101 graduates. Entry requirements: RV32I-Lite CPU synthesized and running; assembler, VM translator, and compiler working end-to-end; Virtus OS v1 booting and running at least three standard-library services on silicon.

Position in the pipeline. Belt 5/5. Part-II anchor. Prerequisites CSA-201 for all six named Part-II electives: VCA-ARM-201, VCA-NET-201, VCA-EMB-201, VCA-NET-301, VCA-X86-201, VCA-VCA-MIPS-201. Cross-track: the XD-strand capstones (XD1 stack/shellcode, XD2 mitigations/ROP, XD3 heap/format strings) all attack Virtus OS v2 built here.

What you will know at the end

Remember. Recite the full RV32I instruction set (47 base instructions, 4 formats). Name the six CSR instructions. State the RISC-V privilege levels (M, S, U) and the trap mechanism (ECALL, mtvec, mepc, mcause, MRET). Name the Sv32 two-level page table structure. Name the PMP register layout (pmpcfg, pmpaddr, modes TOR/NA4/NAPOT).
Understand (hardware). Explain how Zicsr augments the datapath with a separate CSR address space; how trap delivery works cycle-by-cycle; how an MMU translates virtual to physical addresses with a two-level page walk; how PMP blocks regions before they reach the memory bus.
Understand (compiler). Explain what a register allocator does and why naively spilling to the stack costs code size; what a peephole pass finds in a local window; why inlining trades code size for call overhead; what SSA form enables that straight assignment cannot.
Apply (hardware). Extend your CSA-101 CPU to full RV32I: widen the register file to 32 entries, add the M-extension multiplier/divider, add Zicsr, add privilege modes with trap delivery, add Sv32 MMU with a TLB, add PMP. Synthesize each step; verify with riscv-tests.
Apply (compiler + OS). Add a register allocator pass to your CSA-101 compiler; measure emit reduction. Add peephole and inlining passes. Run your compiler against the same source on godbolt.org to compare with gcc at -O2. Build Virtus OS v2: U/S split, page tables, PMP, round-robin scheduler, SSD1306 + SD-card + ENC28J60 drivers.
Analyze (cross-layer). Trace a page fault from a user-mode memory access: the MMU raises the fault, the supervisor trap handler services it, the process resumes. Then trace a stack-smash attempt: the PMP W^X policy intercepts the write to the code page before it reaches the ALU.
Create (capstone). Deliver Virtus OS v2 running on DE10-Nano: U/S privilege transition demonstrated; page-fault handler running; PMP W^X enforced; round-robin scheduler context-switching two tasks; SSD1306 OLED showing output; SD-card filesystem reading a file. ~4,000 lines total across kernel + drivers, vs CSA-101's ~1,500.

Course shape (14 weeks)

Week	Theme	Lab	Petzold weave anchor	Architecture comparison sidebar
1	Full RV32I + M extension	Lab 1.1: `mul` vs Math.multiply speedup measured	Petzold Ch 12 + Ch 13 (binary multiplication and the cost of iteration)	M-extension: ATmega software-mul (2 registers, 16 cycles) vs RV32IM `mul` (single instruction) vs MIPS `mult` (HI/LO registers)
2	Privileged ISA + ecall trap	Lab 2.1: first user-to-supervisor transition; cycle cost measured	Petzold Ch 22 (the OS chapter: supervisor mode emerges from timesharing mainframes)	Trap delivery: x86_64 `SYSCALL`/`SYSRET` vs RISC-V ECALL/MRET vs ARM `SVC`/`ERET`
3	Compiler register allocator	Lab 3.1: allocator pass added; emit reduction observed	Petzold Ch 24 + Ch 17 (from machine code to language and back; what a register is worth)	Register files: RV32I 32-GPR vs x86_64 16-GPR + 16 XMM vs AArch64 31-GPR; the RISC philosophy
4	Compiler peephole optimisation	Lab 4.1: peephole pass; ~30% smaller assembly per §11.9 5-categories	Petzold Ch 24 (high-level language compilation; local windows and the assembler's view)	Peephole scope: LLVM MachineInstr window vs GCC RTL peephole vs hand-written RISC-V idioms
5	Compiler inlining + constant folding	Lab 5.1: inliner pass; library-call overhead measured before/after	Petzold Ch 22 + Ch 24 (procedure calls and their costs)	Inlining policy: GCC/Clang heuristics vs JVM JIT threshold vs your compiler's naïve model
6	SSA-IR + Compiler Explorer (godbolt.org)	Lab 6.1: compare your compiler output vs gcc -O0/-O2/-O3 on identical source	Petzold Ch 24 (the long arc from machine code to optimizing compilers)	SSA-IR: LLVM IR vs GCC GIMPLE vs WebAssembly; why static analysis needs single-assignment form
7	Sv32 paged virtual memory + MMU	Lab 7.1: Sv32 paged VM running; page-fault handler demonstrated	Petzold Ch 16 + Ch 14 (memory hierarchy; segment registers and why hardware abstracts addresses)	Virtual memory: x86_64 CR3 + 4-level paging vs Sv32 2-level vs AArch64 TTBRn; TLB shootdown on SMP
8	PMP + W^X enforcement	Lab 8.1: PMP-defended stack-smash; same exploit from Ch 12 §12.11 now traps cleanly	Petzold Ch 16 (memory protection; why hardware rings exist above software)	Privilege rings: Linux S/U/M three-layer vs Windows ring 3/0/HV vs bare-metal M-only
9	Stack canaries + CFI	Lab 9.1: stack canary detects return-address overwrite; CFI shadow stack catches ROP	Petzold Ch 22 (what the OS protects and what it cannot)	CFI mechanisms: x86_64 CET (SHSTK + IBT) vs AArch64 PAC+BTI vs RISC-V Zicfilp/Zicfiss
10	Tracing garbage collection	Lab 10.1: tracing GC running on Memory.lib; cycle cost measured	Petzold Ch 22 (the OS manages memory so programs don't have to; GC as the logical extreme)	GC strategies: mark-and-sweep vs copying vs generational; JVM G1 vs Python reference-counting vs Go tricolor
11	Preemption + scheduler	Lab 11.1: round-robin scheduler; two demo tasks; context-switch cost measured	Petzold Ch 22 (timesharing: the original reason for supervisor mode)	Schedulers: Linux CFS vs Windows dispatcher vs RTOS fixed-priority vs your round-robin
12	Driver-writing track	Lab 12.1: SSD1306 OLED driver from datasheet; output verified	Petzold Ch 16 + Ch 18 (buses; peripherals; I2C and SPI as descendants of serial ideas Petzold traces)	I2C vs SPI vs UART: protocol overhead; SSD1306 command-byte protocol vs ENC28J60 SPI frame
13	External DRAM + filesystem	Lab 13.1: SD-card filesystem walker reads FAT16 partition	Petzold Ch 14 + Ch 16 (memory hierarchy; DRAM timing; why flash storage is not RAM)	Storage stacks: BRAM on Tang (< 1 MiB ceiling) vs SD-card FAT16 vs NVMe + ext4; endianness in FAT
14	Capstone, Virtus OS v2 on DE10-Nano	Full capstone (see CAPSTONE.md)	Closing reflection on the full ladder from CSA-101 through CSA-201	The bridge talk: where to take Virtus OS v2 next (ARM-201, EMB-201, NET-201, CON-201)

Anchor readings

Primary (continued from CSA-101 at advanced depth).

Patterson and Hennessy, Computer Organization and Design: RISC-V Edition (Morgan Kaufmann). The chapter coverage for CSA-201: Appendix B (RISC-V ISA reference, full RV32I + M); Chapter 4 (pipelining; optional but rewarding for Module 1 context); Chapter 5 (memory hierarchy; pairs with Modules 7 and 13); Chapter 2 (instructions; pairs with Modules 1 and 2). Use as a reference, not a reading schedule; the lab exercises are self-contained.

Petzold, CODE: The Hidden Language of Computer Hardware and Software, 1st edition (1999). CSA-201 adds ~25 new weaves across its chapters. Priority chapters: Ch 12 and Ch 13 (binary multiplication; the M extension pays back Math.multiply's iteration cost in a single instruction); Ch 14 and Ch 16 (memory addressing; the MMU chapter traces directly to Petzold's treatment of how hardware abstracts addresses); Ch 22 (the operating system; supervisor mode emerges from timesharing mainframes; the privilege chapter opens with this). The 2nd edition (2022) covers the same material; locate passages by section title.

Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd edition (Pearson). Use chapters: 7 (linking; pairs with the register allocator discussion), 9 (virtual memory; the definitive reference for the Sv32 module), 3 (machine-level representation; pairs with Modules 3-6 compiler work), 8 (exceptional control flow; the syscall and trap mechanism pairs with Module 2).

Secondary. Waterman and Asanovic, The RISC-V Instruction Set Manual, Volume I: Unprivileged Architecture (open access). Volume II: Privileged Architecture (open access). These are the authoritative ISA specs; use them to verify instruction encodings and CSR definitions during labs.

Per-week time budget

Activity	Hours per week	Hours over 14 weeks
Lecture	~2.5 hr	~35 hr
Lab (hands-on)	~4 hr	~56 hr
Independent practice (reading + repo work + Toolchain Diary)	~6 hr	~84 hr total
Capstone integration weeks	--	+5 hr (weeks 13-14)
Total	~12.5 hr/week	~180 hr

The heaviest weeks are 1 (M-extension hardware integration), 7 (Sv32 MMU page-table walker), and 14 (capstone integration). Budget an extra 2-3 hours in each. The compiler weeks (3-6) run lighter on hardware and heavier on repo editing.

Lab arc summary

Each lab measures a specific cost that CSA-101 deliberately paid and then recovers it.

Lab	What it measures	What it recovers
Lab 1.1	Math.multiply ~1,000-cycle vs `mul` single-cycle speedup	CSA-101's software-multiply dependency
Lab 2.1	ecall trap round-trip cycle cost	CSA-101's M-mode-only baseline
Lab 3.1	Compiler emit size before/after register allocator	Lab 7.4's translator-bloat forward-promise
Lab 4.1	Assembly line count before/after peephole pass	Lab 11.4's 5-categories forward-promise
Lab 5.1	Call overhead before/after inliner	Ch 11 library-call overhead
Lab 6.1	godbolt.org -O0/-O2/-O3 comparison on identical source	Production-grade compiler reference
Lab 7.1	Page-fault handler under Sv32	Flat physical address space
Lab 8.1	PMP-defended stack smash that traps cleanly	Ch 12 §12.11 W^X absence
Lab 9.1	Stack canary + ROP detection	CSA-201's newly-enabled W^X and CFI primitives
Lab 10.1	Tracing-GC cycle cost on Memory.lib	Ch 12 §12.5.4 manual-only allocator
Lab 11.1	Context-switch cost, two-task round-robin	Ch 12 §12.1 single-task baseline
Lab 12.1	SSD1306 driver written from datasheet	IP-Pack black box for OLED
Lab 13.1	FAT16 partition walk on SD card	Tang BRAM ceiling; no persistent storage
Lab 14	Full Virtus OS v2 capstone on DE10-Nano	Everything above, integrated

Hardware requirements

Required. Tang Primer 25K (carried over from CSA-101; your existing bitstream and board). DE10-Nano Cyclone V FPGA development board (~$130, student-purchased; Terasic). Pi 4 station kit (~$80; USB-C power, HDMI cable, microSD 16 GB minimum, USB keyboard). Quartus Prime Lite 23.1 or later (free, Intel/Altera; installs on Linux x86_64 or Windows). riscv32-unknown-elf toolchain (prebuilt binaries at riscv.org or build from source).

Optional but recommended. Logic analyzer (Saleae Logic 8 or open-source equivalent) for driver debugging in Modules 12-13. External DRAM module compatible with DE10-Nano I/O board for Module 13 extended exercises.

Yowasp browser path. The academy workbench Tab 3 supports Tang Primer 25K and Tang Nano 20K bitstream synthesis in-browser (Yosys + nextpnr-himbaechel + apicula). Use this for pre-Quartus sanity checks on RISC-V HDL changes before the full DE10-Nano Quartus build.

See SETUP.md for installation instructions.

Assessment overview

Tier 1 (pass/fail gate). Virtus OS v2 boots on DE10-Nano. Demonstrates U/S privilege transition. Page-fault handler handles at least one synthetic fault. PMP W^X policy intercepts a write to a code page. Round-robin scheduler switches between two running tasks. SSD1306 OLED shows live output. All six gates must pass for a Tier 2 score to count.

Tier 2 (40/30/30). 40% mitigation depth (does your OS actually enforce security properties, or just simulate them?); 30% measurement quality of speedups and cost recovery vs CSA-101 baseline; 30% demo and 6-8 page write-up. B- minimum on Tier 2 for the VCA-CSA-201 Certificate of Completion.

See CAPSTONE.md for the full rubric.

Continuation note

The v0.1 build round covered Modules 1-6 (weeks 1-6) in full, plus the CAPSTONE rubric and INSTRUCTOR-GUIDE skeleton. Modules 7-14 (weeks 7-14), labs 7-14, and the per-week INSTRUCTOR-GUIDE expansion shipped in the continuation round and are live in this classroom.