Week 1: Full RV32I + M Extension · CSA-201 · Virtus Cyber Academy Classroom

Math.multiply costs about 1,000 cycles. After this week, it costs one.

Reading

Required. Patterson and Hennessy, Computer Organization and Design: RISC-V Edition, Appendix B (RISC-V ISA summary; skim for the instructions your CSA-101 CPU did not implement). Read enough to name the gap: 47 base instructions vs the 11 your RV32I-Lite handles.

Required. Petzold, CODE, Ch 12 ("A Binary Adding Machine") and Ch 13 ("But What About Subtraction?"). These chapters trace binary arithmetic from the simplest adding circuit through subtraction. The M extension pays back the cost Math.multiply paid by iterating that circuit in software: a single mul instruction replaces the ~1,000-cycle software loop with dedicated hardware that finishes in one cycle. Read Ch 12-13 now as the hardware complement to what your compiler emitted.

Optional. Waterman and Asanovic, RISC-V ISA Manual Volume I, sections 2.4 (U-type instructions: LUI, AUIPC) and 2.5 (J-type: JAL). Skim the encoding tables.

Lecture: Closing the RV32I Gap

What CSA-101 froze and why

Your CSA-101 CPU implements 11 instructions: ADD, SUB, AND, OR, NOT (pseudo), LW, SW, BEQ, BNE, JMP (pseudo), CALL (pseudo). The full RV32I base integer ISA has 47 instructions. CSA-101 froze at 11 because a full instruction set would have made the hardware and the compiler too large for a 14-week course. Every frozen instruction was listed in the week-5 "What CSA-101 deliberately omits" sidebar. This week closes that list.

The four instruction-type categories your CPU does not yet handle:

U-type (LUI, AUIPC). Load Upper Immediate places a 20-bit immediate into bits 31:12 of a destination register, zeroing bits 11:0. AUIPC adds the PC to that shifted immediate. These instructions are load-time essentials: the linker uses them to synthesize 32-bit absolute and PC-relative addresses. Without LUI, you cannot load any value larger than 12 bits in a single instruction.
JAL (J-type). Jump and link stores the return address (PC+4) into a destination register and jumps to PC + a 21-bit sign-extended offset. This is the standard function-call instruction. Your CSA-101 assembler emitted CALL as a pseudo-instruction that synthesized a jump via a fixed-offset BEQ; JAL is the proper encoding.
Shifts (SLLI, SRLI, SRAI) and comparisons (SLT, SLTU). The shift instructions reuse the I-type encoding with a funct7 bit distinguishing arithmetic right shift (SRAI, sign-extends) from logical (SRLI, zero-fills). Set-less-than produces a 0 or 1 result used heavily by C code that compiles boolean conditions.
Byte and halfword loads/stores (LB, LH, LBU, LHU, SB, SH). Your CSA-101 CPU accessed memory in 32-bit words. Full RV32I adds signed byte/halfword loads (LB, LH, which sign-extend to 32 bits) and their unsigned counterparts (LBU, LHU, which zero-extend). SB and SH write the low byte or halfword of a source register. Without these, the compiler cannot handle char, int8_t, or int16_t types.

The M extension: four multiply and four divide instructions

RV32IM adds eight instructions in a single subextension: MUL, MULH, MULHSU, MULHU (multiply variants), DIV, DIVU, REM, REMU (divide variants).

MUL multiplies two 32-bit registers and returns the low 32 bits of the 64-bit product. This is the instruction that replaces Math.multiply's software loop. Your CSA-101 OS compiled calls to Math.multiply into ~1,000 cycles of iterative addition and shifting. After Module 1, mul completes in one clock cycle.

MULH, MULHSU, MULHU return the high 32 bits of the product for signed-signed, signed-unsigned, and unsigned-unsigned multiply respectively. These are used for 64-bit arithmetic, fixed-point scaling, and overflow detection.

DIV and DIVU perform signed and unsigned 32-bit division. REM and REMU return the remainder. Division is typically handled by an iterative restoring-division circuit or a non-restoring variant. Even with hardware support, a divider takes several cycles (typically 33 for a 32-bit iterative implementation). The lab will measure this.

The register file expansion

CSA-101 used 8 registers (x0-x7). Full RV32I uses 32 (x0-x31). Widening the register file is the first hardware change in Module 1. The Verilog change is mechanical: reg [31:0] rf [0:31] instead of rf [0:7]. The consequence is significant: the register allocator you add in Module 3 now has 29 usable general-purpose registers (x0 is hardwired zero; x1-x31 are available, with the standard ABI reserving ra, sp, gp, tp, and the temporaries per the calling convention).

Architecture Comparison Sidebar: The cost of multiplication

Every hardware architecture has made a different decision about where multiplication lives.

ATmega8515 (AVR, 8-bit). The ATmega family has no hardware multiplier in many variants. An 8-bit software-multiply loop (ldi r16, 8; loop: lsl r0; brcc no_add; add r1, r2; no_add: dec r16; brne loop) costs 8 iterations times 3-5 cycles each, roughly 30-40 cycles for an 8-bit result. 32-bit multiplication by long-multiplication decomposition costs several hundred cycles. Math.multiply in CSA-101 used a similar approach scaled to 32 bits.

RV32IM mul. A single instruction, one cycle latency (or a few cycles if the multiplier is pipelined). The physical implementation is a Wallace-tree multiplier or a Booth-encoded multiplier: an array of partial-product adders that evaluates all 32-bit partial products simultaneously rather than iterating. The Tang Primer 25K Gowin GW5A-25 has four 18x18 DSP blocks that the synthesis tool uses for the M-extension multiplier.

MIPS mult / mflo. MIPS separates multiplication into a two-step operation: mult rs, rt writes the result into HI:LO special registers, and mflo rd moves the low result into a general register. This design choice meant MIPS could start a multiplication speculatively and retrieve the result later. MIPS32 later added mul rd, rs, rt (single-step) to eliminate the dependency on HI/LO.

x86_64 imul. x86_64 provides imul reg, reg/mem (two-operand, returns low 32 bits) and imul reg (one-operand, returns EDX:EAX). Modern out-of-order x86 processors execute imul with 3-cycle latency and 1-cycle throughput on all execution ports that support integer multiplication.

The lesson: every architecture eventually adds hardware multiply because software loops are 30x-1000x slower. The only question is how many years it takes.

Lab exercises

See labs/lab-1-m-extension-speedup.md for the full specification.

Lab 1.1: M extension speedup measurement. You will extend your CSA-101 CPU reference with the M extension, synthesize to Tang Primer 25K, and measure the speedup from mul vs the Math.multiply software loop. Expected result: ~1,000x. Record your exact measurement in your Toolchain Diary.

The lab has three parts:

Implement the base RV32I gap-fill (32-register file, LUI/AUIPC/JAL, shifts, byte/half loads/stores, BLT/BGE/BLTU/BGEU). Run the rv32ui-p-* riscv-tests suite against your Verilator simulation; all tests must pass before you proceed.
Implement the M extension (muldiv.v, iterative multiplier and divider). Run rv32um-p-* riscv-tests. All tests pass.
Write a benchmark: a C program that calls a 32x32 software multiply loop 1,000 times and reads csrr mcycle before and after, then does the same with mul. Compile with riscv32-unknown-elf-gcc -march=rv32im -O0. Run on Tang Primer 25K. Record the cycle counts and compute the speedup ratio.

Independent practice

Read Petzold Ch 12-13 as assigned above. While reading, note where Petzold's adding machine maps onto your own ALU implementation from CSA-101 week 2. The binary adder Petzold builds with relays is the same architecture as your alu.v; the M extension's multiplier is what Petzold's hand-cranked multiplication would become if you replaced the relay grid with parallel partial-product adders.
Diff vca-csa-101/hdl/core/regfile.v against vca-csa-201/hdl/core/regfile.v. Write a one-paragraph Toolchain Diary entry describing what changed and why: what new functionality do the additional 24 registers enable?
Read the RISC-V ISA Manual Volume I section 2.2 (base integer instruction formats). Draw the bit-field layout for R-type, I-type, S-type, B-type, U-type, and J-type instructions. The B-type encoding (branches) and J-type encoding (JAL) are unusual: their immediate bits are scrambled to keep the sign bit at bit 31 and to share the encoding machinery with other types. Annotate your drawing to show where the sign bit lives in each format.
Toolchain Diary entry: riscv32-unknown-elf-gcc. Record the version, the flags you used (-march=rv32im -O0), and the output of riscv32-unknown-elf-objdump -d on a small C function that uses multiplication. Count the number of mul instructions generated. This is the baseline for Module 3 (register allocator) and Module 4 (peephole).

Architecture comparison sidebar: Register file design

Architecture	General-purpose registers	Notes
RV32I-Lite (CSA-101)	8 (x0-x7)	CSA-101 simplification; x0 hardwired zero
Full RV32I (CSA-201)	32 (x0-x31)	x0 hardwired; 31 usable; standard ABI reserves ra, sp, gp, tp
x86_64	16 GPR (rax-r15) + 16 XMM + 8 YMM	Legacy 8 (eax-edi) + 8 added in AMD64; XMM for SIMD
AArch64	31 GPR (x0-x30) + x31 (zero/SP)	Separate zero register and stack pointer sharing encoding x31
MIPS32	32 (r0-r31)	r0 hardwired zero; HI/LO for multiply/divide results
6502 (CSA-101 historical)	3 (A, X, Y) + SP + PC	Accumulator-based; the register poverty that drove RISC philosophy

The RISC philosophy driving the 32-register design: more registers mean the compiler can keep values in registers instead of spilling to the stack (where a load/store costs multiple cycles). The x86_64 register file's eight legacy registers (the original IA-32 set) forced compilers to spill frequently; this is one reason x86_64 code has more memory traffic than equivalent RV32I code at the same optimization level. The register allocator in Module 3 exists precisely because compilers need to decide which values live in registers and which live on the stack.

Reflection prompts

CSA-101's Math.multiply implementation iterates through 32 shifts and conditional adds. Sketch what a Wallace-tree multiplier does differently. Why can it evaluate all partial products in parallel while CSA-101's loop cannot?
LUI places a 20-bit immediate into bits 31:12 of a register. Why does the linker need this instruction? What would a compiler do to load the address 0xDEAD_BEEF into a register without LUI?
The M extension is optional in the RISC-V spec: a chip is compliant without it. Name one real product where omitting hardware multiplication is deliberate (hint: think microcontrollers). What does that product sacrifice, and what does it gain?

What's next

Module 2 adds the privileged ISA: supervisor mode, machine mode, the ECALL trap mechanism, and the first kernel code for Virtus OS v2. The hardware changes (privilege.v, trap_ctrl.v, csrfile.v with the full trap CSR set) are smaller than this week's register file and multiplier work. The software changes are larger: you will write your first trap handler in assembly and transition a user-mode program to supervisor mode.