Week 7: Sv32 Paged Virtual Memory and the MMU · CSA-201

CSA-101 ran in flat physical memory: every address your program used was a real address on the bus. That meant every program could read every other program's memory. This week you fix that.

Reading

Required. Petzold, CODE, Ch 16 ("An Assemblage of Memory") and Ch 14 ("Feedback and Flip-Flops"). Ch 14 establishes that memory is state-holding hardware; Ch 16 traces the evolution from single-bank memory to banked memory to the segment registers of the 8086 -- the direct predecessor of virtual memory. Petzold describes the 8086 segment register as "a way to extend addressing beyond 16 bits without widening the instruction set." Virtual memory is the answer to the same problem at OS scale: extend the address space each process sees beyond what physical RAM can hold, while giving each process the illusion of owning the entire space.

Required. Bryant and O'Hallaron, CSAPP, Chapter 9 ("Virtual Memory"), sections 9.1-9.6. The most thorough accessible treatment of virtual memory, TLBs, and page tables. Section 9.6 (address translation) maps directly to the Sv32 walk you will implement this week. Read through section 9.6 before the lab.

Required. Waterman and Asanovic, RISC-V ISA Manual Volume II: Privileged Architecture, section 4.3 (Sv32 page-based 32-bit virtual-memory system) and 4.2 (SFENCE.VMA). These are the authoritative implementation spec for your lab.

Lecture: Sv32 Paged Virtual Memory

Why virtual memory exists

A process in U-mode on your CSA-201 CPU can currently read or write any physical address. If two processes run concurrently (as they will in Module 11), either can corrupt the other's stack, heap, or code. Virtual memory solves this by interposing a hardware translator between every memory access and the bus: the program issues a virtual address; the MMU translates it to a physical address; only the physical address reaches DRAM. Each process has a separate page table; the OS controls which physical pages are mapped into which process's virtual space.

Virtual memory also enables three OS features that Module 11 and Module 13 depend on: demand paging (physical pages are allocated only when first touched, not at process creation), memory-mapped files (a file appears as a range of virtual addresses; reads and writes go to the file via the page-fault handler), and shared memory (two processes can map the same physical page at different virtual addresses).

Sv32 address structure

Sv32 uses 4 KiB pages (2^12 bytes). A 32-bit virtual address is split:

31        22 21        12 11          0
[ VPN[1]   ][ VPN[0]   ][  offset   ]
  10 bits     10 bits     12 bits

VPN[1] and VPN[0] are the two-level page table indices. The offset is the byte offset within the page.

The physical address space is 34 bits wide (Sv32 uses 22-bit physical page numbers plus 12-bit offset). Since the DE10-Nano's Cyclone V has 1 GB DDR3 (30-bit physical address), all physical addresses fit in 30 bits; the upper bits are zero.

The two-level page walk

The page-table base register satp (Supervisor Address Translation and Protection) holds:

Bits 31: MODE (0 = bare/no translation; 1 = Sv32)
Bits 30:22: ASID (9-bit address space identifier; used for TLB tagging)
Bits 21:0: PPN (22-bit physical page number of the root page table)

On every memory access in Sv32 mode, the MMU performs the two-level walk:

Step 1 (first-level PTE). Compute the physical address of the first-level PTE:

L1_pte_addr = (satp.PPN << 12) | (VPN[1] << 2)

Read the 32-bit PTE at that address.

Step 2 (PTE validity check). Each PTE is 32 bits:

31        10  9   8   7   6   5   4   3   2   1   0
[ PPN[21:0] ] [RSW] [D] [A] [G] [U] [X] [W] [R] [V]

V (valid): if 0, page fault.
R/W/X: read/write/execute permission bits.
If R=0, W=0, X=0: this is a pointer to the next level (non-leaf PTE).
U (user): if 0, user-mode access to this page causes a fault.
A (accessed): set by the OS on first access; hardware can trap-and-set or set automatically. The reference implementation traps on A=0 (simpler hardware).
D (dirty): set on first write. Trap on D=0 for stores (same policy).

Step 3 (second-level PTE). If step 2 found a non-leaf PTE:

L0_pte_addr = (L1_PTE.PPN << 12) | (VPN[0] << 2)

Read the 32-bit PTE at that address.

Step 4 (physical address assembly). From the leaf PTE:

paddr = (leaf_PTE.PPN << 12) | vaddr.offset

Page fault causes. Three page-fault exception codes: 12 = instruction page fault, 13 = load page fault, 15 = store/AMO page fault.

The TLB

Two sequential memory reads per access (the two PTEs) would double or triple the effective memory latency. The TLB (Translation Lookaside Buffer) caches recent VA-to-PA translations. On a TLB hit, the physical address is served immediately; the two-level walk is skipped.

The reference implementation (mmu.v) uses a 1-entry direct-mapped iTLB and a 1-entry direct-mapped dTLB. This is the minimal implementation. A TLB miss forces the full two-level walk. For production kernels, 64-entry set-associative TLBs are typical (the DE10-Nano's HPS uses a 128-entry TLB per Cortex-A9 core).

SFENCE.VMA is the instruction that invalidates TLB entries. After the OS changes a page table entry (mapping, unmapping, or changing permissions), it must execute SFENCE.VMA to ensure the TLB does not serve stale translations. The reference implementation does a full TLB flush on every SFENCE.VMA (ignoring the optional rs1/rs2 selectivity).

Walker FSM

The reference implementation's walker has four states:

S_IDLE. Check the TLB. On a hit, emit the physical address immediately. On a miss, compute the L1 PTE address and transition to S_L1.
S_L1. Drive the walker address to the L1 PTE; read combinationally (the data memory interface is combinational-read in the sim model). Classify the PTE: invalid → page fault; non-leaf → transition to S_L0; megapage leaf → emit physical address.
S_L0. Drive the walker address to the L0 PTE; read combinationally. Classify: invalid → page fault; leaf → fill TLB, transition to S_DONE.
S_DONE. Assert resp_valid for one cycle; return to S_IDLE.

satp setup and kernel/user address space split

The OS sets satp before scheduling a user process. The kernel itself runs with satp set to a kernel page table (or, for the minimal CSA-201 implementation, with satp.MODE=0 when in M-mode and satp.MODE=1 when entering S-mode). The canonical split for Sv32: user virtual addresses occupy 0x00000000 to 0xBFFFFFFF (3 GiB); kernel virtual addresses occupy 0xC0000000 to 0xFFFFFFFF (1 GiB high kernel map).

Architecture Comparison Sidebar: Virtual memory implementations

Architecture	Page table depth	Page size (min)	VA width	PA width	TLB size (typical)
Sv32 (CSA-201)	2-level	4 KiB	32-bit	34-bit	1-entry (lab); 64-entry (production)
x86_64 4-level paging	4-level	4 KiB	48-bit (57-bit with LA57)	52-bit	64-entry L1 dTLB + 512-entry L2 STLB
AArch64 4KB granule	4-level	4 KiB	48-bit (or 52-bit LPA)	48-bit	48-entry L1 + 1024-entry L2 (Cortex-A77)
MIPS32 (software TLB)	Software-managed	4 KiB	32-bit	36-bit	16-64 entry software-filled wired TLB
Tang Primer 25K (flat)	None (CSA-101)	N/A	32-bit (physical)	32-bit	N/A

The key difference: x86_64 and AArch64 use hardware page-table walkers (the hardware reads PTEs from DRAM automatically on TLB miss). MIPS uses a software-managed TLB (a TLB miss fires a trap; the OS trap handler reads the PTE and writes it into the TLB). RISC-V is hardware-walker, like x86_64.

Sv32 is smaller than x86_64's 4-level table because the address space is smaller (32-bit vs 48-bit). Two levels of 1024-entry tables (1024 * 1024 = ~1M pages) cover the entire 32-bit space.

Lab exercises

See labs/lab-7-sv32-mmu.md for the full specification.

Lab 7.1: Sv32 paged VM running with page-fault handler. You will instantiate mmu.v from the reference implementation into your CPU, write a minimal kernel page table, and demonstrate a page-fault handler that maps a new page on demand.

The lab has three parts: (A) integrate mmu.v and verify bare-mode (MODE=0) still passes all prior tests; (B) set up a two-entry kernel page table and verify that a user-mode access to a mapped address succeeds and an unmapped address triggers mcause=13; (C) write a demand-paging handler that maps the faulting page and re-runs the faulting instruction.

Independent practice

Read CSAPP Ch 9.6 (address translation) and draw the Sv32 VA-to-PA translation diagram annotated with the field widths (VPN[1] = 10 bits, VPN[0] = 10 bits, offset = 12 bits, PPN[1] = 12 bits, PPN[0] = 10 bits). Verify that satp.PPN[21:0] shifted left 12 gives the root page table's physical base address.
A process has a 1-page code segment at virtual address 0x00001000 and a 1-page stack at 0x00002000. The OS maps code to physical page 0x80001 and stack to physical page 0x80002. Show the complete two-entry L0 page table (each PTE as a 32-bit hex value) and the single L1 PTE that points to it.
Toolchain Diary entry: SFENCE.VMA. Write what this instruction does, why it is needed after any page table modification, and what happens if the OS modifies a PTE without executing SFENCE.VMA.
The reference implementation's TLB has 1 entry. A tight loop that alternates between two different virtual pages will always miss in the TLB (every second access evicts the previous entry). Estimate the cycle overhead of a 1-entry TLB for a loop with a 2-page working set vs a 64-entry TLB with the same loop. What does this suggest about production TLB sizing?

Reflection prompts

In S_L1, the walker classifies a PTE with R=0, W=0, X=0 as a non-leaf (pointer to the next level). The RISC-V spec also says a PTE with R=0, W=0, X=0 and V=1 could represent a valid non-leaf or could be invalid. What additional bit distinguishes a valid non-leaf from an invalid leaf? (Check the spec.)
The A (accessed) and D (dirty) bits exist in each PTE. The reference implementation traps when A=0 rather than setting it in hardware. What does the OS trap handler do with the trapped access? Write the 3-line handler pseudocode.
Why does the 1-entry TLB in the reference implementation have separate entries for instruction fetch (iTLB) and data access (dTLB) rather than a single unified TLB? What access pattern would miss in a unified TLB that would hit in separate iTLB/dTLB?

What's next

Module 8 adds PMP (Physical Memory Protection) on top of the virtual memory system from this week. PMP is a second hardware enforcement layer below the MMU: it enforces per-region R/W/X permissions at the physical address level before the memory bus. The classic W^X policy -- code pages are executable but not writable; data pages are writable but not executable -- is enforced at the PMP layer.