Total points: 30
Estimated time: 4 hours
Prerequisites: CSA-101 Tang Primer 25K development environment; riscv32-unknown-elf toolchain; Verilator or iverilog; riscv-tests suite installed
Overview
This lab has three parts. Part A extends your CSA-101 CPU to full RV32I base and verifies with riscv-tests. Part B adds the M extension and verifies. Part C writes a benchmark program that quantifies the speedup from hardware multiply vs your Math.multiply software loop. The expected speedup is approximately 1,000x.
Part A: Full RV32I base (12 pts)
A1: Widen the register file (2 pts)
In hdl/core/regfile.v, change the register file from 8 entries to 32 entries:
// before:
reg [31:0] rf [0:7];
// after:
reg [31:0] rf [0:31];
Update the read and write port address widths from 3 bits to 5 bits throughout the file and in any testbenches that reference it. Verify: your existing CSA-101 test programs still run correctly (x0-x7 behavior is unchanged).
A2: Add U-type and J-type instructions (4 pts)
Add decoder support and execution logic for:
LUI rd, imm20:rd = imm20 << 12AUIPC rd, imm20:rd = PC + (imm20 << 12)JAL rd, imm21:rd = PC + 4; PC = PC + sign_extend(imm21)(note: imm21 bits are scrambled in the J-type encoding; implement the bit reassembly carefully)
Write three hand-encoded test vectors (one per instruction) and verify the results in simulation.
A3: Add shifts, SLT, byte/half loads and stores (6 pts)
Add decoder and execute-stage support for:
SLLI rd, rs1, shamt: logical left shiftSRLI rd, rs1, shamt: logical right shift (zero-fill)SRAI rd, rs1, shamt: arithmetic right shift (sign-extend; distinguished by funct7[5]=1)SLT rd, rs1, rs2: signed less-than comparison; result is 0 or 1SLTU rd, rs1, rs2: unsigned less-than comparisonLB rd, imm12(rs1),LH rd, imm12(rs1): signed byte/halfword load (sign-extend to 32 bits)LBU rd, imm12(rs1),LHU rd, imm12(rs1): unsigned byte/halfword load (zero-extend)SB rs2, imm12(rs1),SH rs2, imm12(rs1): byte/halfword store (write low 8/16 bits only)BLT rs1, rs2, imm13: signed branch-if-less-thanBGE rs1, rs2, imm13: signed branch-if-greater-or-equalBLTU rs1, rs2, imm13: unsigned BLTBGEU rs1, rs2, imm13: unsigned BGEFENCE: treat as NOP for single-core simulationECALL,EBREAK: treat as halt/trap stub (full implementation in Module 2)
Verification gate. Run the rv32ui-p-* riscv-tests suite against your Verilator simulation. All 38 tests must pass with PASS before you proceed to Part B.
cd sim/
make rv32ui # runs all rv32ui-p-* tests; expects 38/38 PASS
Record: which test took the most cycles? Which was fastest? Note in Toolchain Diary.
Part B: M extension (8 pts)
B1: Implement the multiplier (4 pts)
Create hdl/core/muldiv.v containing a multi-cycle multiplier and divider. The multiplier can be:
Option 1 (simpler, slower): Iterative shift-and-add. 32 cycles to produce the 64-bit product.
Option 2 (faster, more LUT-efficient): Use a single DSP block. The Gowin GW5A-25 has four 18x18 signed DSP blocks. Instantiate MULTADDALU18X18 from the Gowin IP library to compute the partial products.
Implement all four multiply instructions: MUL (low 32 bits), MULH (signed-signed high 32 bits), MULHSU (signed-unsigned high 32), MULHU (unsigned-unsigned high 32).
B2: Implement the divider (4 pts)
Implement the iterative restoring-division algorithm for DIV (signed), DIVU (unsigned), REM (signed remainder), REMU (unsigned remainder). Handle the edge case of division by zero (result = 0xFFFFFFFF per the RISC-V spec for DIVU; result = -1 for DIV; remainder = dividend for both).
Verification gate. Run the rv32um-p-* riscv-tests suite. All tests must pass.
make rv32um # expects PASS for mul, mulh, mulhsu, mulhu, div, divu, rem, remu
Record: how many cycles does a 32-bit div take? How many cycles does a 32-bit mul (iterative) take? Note in Toolchain Diary.
Part C: Speedup measurement (10 pts)
C1: Instrument with cycle counter (2 pts)
Add a read-only mcycle CSR to your core (a 64-bit counter that increments every clock cycle; readable via csrr rd, mcycle = CSR address 0xC00, and csrr rd, mcycleh = 0xC80 for the upper 32 bits). This is a Zicsr stub; the full CSR implementation is Module 2. For now, implement mcycle and mcycleh as read-only counters.
C2: Write the benchmark (5 pts)
Write a benchmark program in RISC-V assembly or C that:
- Reads
mcycleintoa0:a1(lower:upper 32 bits). - Calls your CSA-101 Math.multiply software routine 1,000 times with arguments (12345, 67890).
- Reads
mcycleagain; computes elapsed cycles. - Writes the software-multiply cycle count to output address 0xFFFF0000 (your simulator output port).
- Resets, then calls
mul a2, a0, a1(M-extension hardware multiply) 1,000 times. - Reads
mcycleagain; writes hardware-multiply cycle count to output address 0xFFFF0004.
C3: Run and record (3 pts)
Run the benchmark on:
- Tang Primer 25K with your bitstream (full silicon timing; requires synthesis + flash).
- Verilator simulation (functional correctness without silicon timing; faster to iterate).
Record the results in this table:
| Method | Total cycles (1,000 calls) | Cycles per call | Ratio |
|---|---|---|---|
| Math.multiply (software loop) | baseline | ||
mul (hardware, iterative) |
|||
mul (hardware, DSP block) |
(if you implemented option 2) |
Expected: hardware mul should be approximately 1,000x faster per call than the software loop.
Submit: the completed table + your benchmark source + a screenshot of the Tang Primer 25K synthesized bitstream utilization report showing LUT4 and DSP counts.
Grading
| Part | Criteria | Points |
|---|---|---|
| A1 | Register file widened; existing tests still pass | 2 |
| A2 | LUI/AUIPC/JAL correct per test vectors | 4 |
| A3 | All new instructions correct; rv32ui-p-* 38/38 PASS | 6 |
| B1 | All mul variants correct; rv32um-p-mul-* PASS | 4 |
| B2 | All div/rem variants correct; rv32um-p-div-* PASS; edge cases handled | 4 |
| C1 | mcycle/mcycleh readable; increments each cycle | 2 |
| C2 | Benchmark measures both methods correctly | 5 |
| C3 | Completed table; plausible speedup ratio with explanation | 3 |
| Total | 30 |