Lab 1.1: M Extension Speedup Measurement · CSA-201

Total points: 30
Estimated time: 4 hours
Prerequisites: CSA-101 Tang Primer 25K development environment; riscv32-unknown-elf toolchain; Verilator or iverilog; riscv-tests suite installed

Overview

This lab has three parts. Part A extends your CSA-101 CPU to full RV32I base and verifies with riscv-tests. Part B adds the M extension and verifies. Part C writes a benchmark program that quantifies the speedup from hardware multiply vs your Math.multiply software loop. The expected speedup is approximately 1,000x.

Part A: Full RV32I base (12 pts)

A1: Widen the register file (2 pts)

In hdl/core/regfile.v, change the register file from 8 entries to 32 entries:

// before:
reg [31:0] rf [0:7];
// after:
reg [31:0] rf [0:31];

Update the read and write port address widths from 3 bits to 5 bits throughout the file and in any testbenches that reference it. Verify: your existing CSA-101 test programs still run correctly (x0-x7 behavior is unchanged).

A2: Add U-type and J-type instructions (4 pts)

Add decoder support and execution logic for:

LUI rd, imm20: rd = imm20 << 12
AUIPC rd, imm20: rd = PC + (imm20 << 12)
JAL rd, imm21: rd = PC + 4; PC = PC + sign_extend(imm21) (note: imm21 bits are scrambled in the J-type encoding; implement the bit reassembly carefully)

Write three hand-encoded test vectors (one per instruction) and verify the results in simulation.

A3: Add shifts, SLT, byte/half loads and stores (6 pts)

Add decoder and execute-stage support for:

SLLI rd, rs1, shamt: logical left shift
SRLI rd, rs1, shamt: logical right shift (zero-fill)
SRAI rd, rs1, shamt: arithmetic right shift (sign-extend; distinguished by funct7[5]=1)
SLT rd, rs1, rs2: signed less-than comparison; result is 0 or 1
SLTU rd, rs1, rs2: unsigned less-than comparison
LB rd, imm12(rs1), LH rd, imm12(rs1): signed byte/halfword load (sign-extend to 32 bits)
LBU rd, imm12(rs1), LHU rd, imm12(rs1): unsigned byte/halfword load (zero-extend)
SB rs2, imm12(rs1), SH rs2, imm12(rs1): byte/halfword store (write low 8/16 bits only)
BLT rs1, rs2, imm13: signed branch-if-less-than
BGE rs1, rs2, imm13: signed branch-if-greater-or-equal
BLTU rs1, rs2, imm13: unsigned BLT
BGEU rs1, rs2, imm13: unsigned BGE
FENCE: treat as NOP for single-core simulation
ECALL, EBREAK: treat as halt/trap stub (full implementation in Module 2)

Verification gate. Run the rv32ui-p-* riscv-tests suite against your Verilator simulation. All 38 tests must pass with PASS before you proceed to Part B.

cd sim/
make rv32ui   # runs all rv32ui-p-* tests; expects 38/38 PASS

Record: which test took the most cycles? Which was fastest? Note in Toolchain Diary.

Part B: M extension (8 pts)

B1: Implement the multiplier (4 pts)

Create hdl/core/muldiv.v containing a multi-cycle multiplier and divider. The multiplier can be:

Option 1 (simpler, slower): Iterative shift-and-add. 32 cycles to produce the 64-bit product.

Option 2 (faster, more LUT-efficient): Use a single DSP block. The Gowin GW5A-25 has four 18x18 signed DSP blocks. Instantiate MULTADDALU18X18 from the Gowin IP library to compute the partial products.

Implement all four multiply instructions: MUL (low 32 bits), MULH (signed-signed high 32 bits), MULHSU (signed-unsigned high 32), MULHU (unsigned-unsigned high 32).

B2: Implement the divider (4 pts)

Implement the iterative restoring-division algorithm for DIV (signed), DIVU (unsigned), REM (signed remainder), REMU (unsigned remainder). Handle the edge case of division by zero (result = 0xFFFFFFFF per the RISC-V spec for DIVU; result = -1 for DIV; remainder = dividend for both).

Verification gate. Run the rv32um-p-* riscv-tests suite. All tests must pass.

make rv32um   # expects PASS for mul, mulh, mulhsu, mulhu, div, divu, rem, remu

Record: how many cycles does a 32-bit div take? How many cycles does a 32-bit mul (iterative) take? Note in Toolchain Diary.

Part C: Speedup measurement (10 pts)

C1: Instrument with cycle counter (2 pts)

Add a read-only mcycle CSR to your core (a 64-bit counter that increments every clock cycle; readable via csrr rd, mcycle = CSR address 0xC00, and csrr rd, mcycleh = 0xC80 for the upper 32 bits). This is a Zicsr stub; the full CSR implementation is Module 2. For now, implement mcycle and mcycleh as read-only counters.

C2: Write the benchmark (5 pts)

Write a benchmark program in RISC-V assembly or C that:

Reads mcycle into a0:a1 (lower:upper 32 bits).
Calls your CSA-101 Math.multiply software routine 1,000 times with arguments (12345, 67890).
Reads mcycle again; computes elapsed cycles.
Writes the software-multiply cycle count to output address 0xFFFF0000 (your simulator output port).
Resets, then calls mul a2, a0, a1 (M-extension hardware multiply) 1,000 times.
Reads mcycle again; writes hardware-multiply cycle count to output address 0xFFFF0004.

C3: Run and record (3 pts)

Run the benchmark on:

Tang Primer 25K with your bitstream (full silicon timing; requires synthesis + flash).
Verilator simulation (functional correctness without silicon timing; faster to iterate).

Record the results in this table:

Method	Total cycles (1,000 calls)	Cycles per call	Ratio
Math.multiply (software loop)			baseline
`mul` (hardware, iterative)
`mul` (hardware, DSP block)			(if you implemented option 2)

Expected: hardware mul should be approximately 1,000x faster per call than the software loop.

Submit: the completed table + your benchmark source + a screenshot of the Tang Primer 25K synthesized bitstream utilization report showing LUT4 and DSP counts.

Grading

Part	Criteria	Points
A1	Register file widened; existing tests still pass	2
A2	LUI/AUIPC/JAL correct per test vectors	4
A3	All new instructions correct; rv32ui-p-* 38/38 PASS	6
B1	All mul variants correct; rv32um-p-mul-* PASS	4
B2	All div/rem variants correct; rv32um-p-div-* PASS; edge cases handled	4
C1	mcycle/mcycleh readable; increments each cycle	2
C2	Benchmark measures both methods correctly	5
C3	Completed table; plausible speedup ratio with explanation	3
Total		30