Week 5: Computer Architecture — CPU Integration and First Boot · CSA-110

Everything you built since Week 1 connects here. The ALU, the register file, the instruction decoder, and the memory module become a single cpu.v module. You synthesize it to a bitstream, flash it to the Tang Primer 25K, and run the sum-to-N program from Week 4 on the silicon. This is the CSA-110 milestone week.

Reading

Petzold weave anchors. Ch 17 (Automation, p. 239) second visit: now read for the description of the control unit that sequences instruction fetch, decode, and execute; Ch 18 (From Abaci to Chips, p. 253) which traces the progression from sequential relay logic to synchronous digital circuits. ~30 pages.
Cross-chapter handout. handouts/cross-chapter-rv32i-lite-encoding-card.md: keep this open for the instruction-decoder lab.

Lecture

3 hours. Key arc:

The fetch-decode-execute cycle. At each clock edge, the CPU:

Fetches the instruction at mem[PC]
Decodes it: extracts opcode, rd, rs1, rs2, funct3, funct7, immediate
Reads the register file: retrieves rs1 and rs2
Executes: passes rs1, rs2 (or the immediate) to the ALU; or computes a memory address
Writes back: stores the ALU result or memory read into rd; updates PC

Your RV32I-Lite CPU is a single-cycle design: all five stages happen in one clock cycle. The clock frequency is limited by the slowest stage (typically the 32-bit ripple-carry adder from Week 2). CSA-201 pipelining separates these into five stages to allow a higher clock.

The instruction decoder. The decoder reads the 32-bit instruction word and asserts control signals:

module decoder (
    input  wire [31:0] instr,
    output reg         reg_we,      // write enable for register file
    output reg         mem_we,      // write enable for memory
    output reg [2:0]   alu_op,      // which ALU operation
    output reg         alu_src_b,   // 0=register, 1=immediate
    output reg         branch,      // is this a branch?
    output reg         jump,        // is this a JAL/JALR?
    output reg  [1:0]  mem_to_reg,  // 0=ALU result, 1=mem read, 2=PC+4
    ...
);
    wire [6:0] opcode = instr[6:0];
    wire [2:0] funct3 = instr[14:12];
    wire [6:0] funct7 = instr[31:25];
    // ...
    always @(*) begin
        case (opcode)
            7'b0110011: begin // R-type (ADD, SUB, AND, OR)
                reg_we = 1; mem_we = 0; alu_src_b = 0;
                // funct3 + funct7 determine which ALU op
            end
            7'b0010011: begin // I-type (ADDI)
                reg_we = 1; mem_we = 0; alu_src_b = 1;
                alu_op = 3'b000; // add
            end
            // ...
        endcase
    end
endmodule

Wiring the data path. The complete cpu.v module instantiates and connects: pc_reg (the program counter, a 32-bit DFF), imem (instruction memory), decoder, regfile, immgen (immediate generator that sign-extends and recombines the split immediate fields), alu, dmem (data memory), and writeback mux.

The 6502 comparison. Arlet's cpu.v in CSA-101 is a state-machine CPU: it uses a multi-cycle approach where some instructions take 2-7 clock cycles because the 6502 was designed for slow external memory. Your RV32I-Lite single-cycle design is architecturally simpler to understand because every instruction takes exactly one cycle. The cost is that the clock must be slow enough for the longest instruction path (typically the load instruction: fetch → decode → read rs1 → add immediate → read memory → write rd).

The Virtus OS forward pointer. The CPU you build this week runs machine code. Weeks 6-12 build the toolchain that generates machine code from higher-level representations. Week 13 uses that toolchain to build Virtus OS, which then runs on this CPU.

Lab exercises

Five labs in labs/lab-5.md. Plan for ~6 hours. This is the most rewarding lab week.

Lab 5.1. Write decoder.v. Run lab5_decoder_tb.v which feeds the testbench all 11 RV32I-Lite opcode variants and checks the control signal outputs.
Lab 5.2. Write immgen.v (immediate generator). It extracts and sign-extends the immediate from any instruction format. Run the testbench for all four formats (I, S, B, J). Pay attention to B-type and J-type where the immediate bits are reordered.
Lab 5.3. Wire cpu.v: instantiate and connect all modules from Weeks 2-5. The instruction memory (imem) is initialized from a hex file. Load the sum-to-N hex from Lab 4.4 and verify the result register holds 55 (sum of 1..10) in simulation.
Lab 5.4. Synthesize to a Tang Primer 25K bitstream. Check the synthesis report for LUT count, Fmax, and BRAM utilization. Flash the bitstream and observe the UART output showing the sum-to-N result.
Lab 5.5 (seeded failure). The testbench for Lab 5.3 includes one deliberately broken instruction (a BEQ with an off-by-one branch offset). Find it, fix it in the hex file, and verify the corrected program produces the right answer. Record the debugging session in your Toolchain Diary.

Independent practice

Compare your cpu.v line count against Arlet's cpu.v from CSA-101. Note the difference in complexity and explain it in one paragraph in your Toolchain Diary. What does Arlet's 6502 core do that your single-cycle RV32I-Lite core does not need to do?
Look at the synthesis Fmax report from Lab 5.4. Which signal path is the critical path (the one limiting Fmax)? Is it in the ALU, the decoder, or the memory path?
Read the "What CSA-110 deliberately omits" sidebar in the Week 5 chapter prose. These are the same omissions the original CSA-101 made; CSA-201 closes them one by one.

What CSA-110 deliberately omits (named here so students can find them in CSA-201)

Feature	Why omitted	Where it lands
Pipelining	Adds hazard detection and forwarding; not needed to run correct programs	CSA-201 lecture (not a graded lab)
Full 32-register file	8 registers is enough to teach the concept; 32 is the real count	CSA-201 Module 1
Full RV32I instruction set (47 ops)	11 is enough to implement a complete toolchain	CSA-201 Module 1
CSR instructions + privileged ISA	Needed for OS trap handling and timers	CSA-201 Module 2
Hardware multiplier	Math.multiply works in software at the cost of ~1,000 cycles	CSA-201 Module 1
Branch predictor	Single-cycle always predicts not-taken	CSA-201 lecture

Architecture comparison sidebar

Single-cycle CPU vs 6502 state machine vs ARMv8 out-of-order.

Your RV32I-Lite single-cycle CPU executes one instruction per clock cycle. The Fmax is limited by the longest combinational path across one cycle (typically the load-word path: PC fetch → decode → register read → ALU → data memory → register write). A 4 MHz clock with a 250ns period is achievable for a ripple-carry adder design on the Tang Primer 25K; a carry-lookahead adder (CSA-201) can push this to 20-50 MHz.

The original 6502 ran at 1-2 MHz. This was NOT single-cycle: most instructions took 2-7 clock cycles because the external memory bus could not be read in one cycle (slow DRAM from 1975). Arlet's Verilog 6502 uses a synchronous-memory assumption that would allow much faster clocking if the original DRAM were replaced with BRAM, but the multi-cycle state machine is preserved for behavioral compatibility. Your RV32I-Lite assumes fast BRAM and exploits it.

ARMv8 Cortex-A out-of-order cores execute multiple instructions per clock cycle (3-5 wide issue on Apple M2) and can have 12 or more cycles in flight simultaneously through pipelined execution units. The 200x speed difference between your CSA-110 CPU and an M2 core is almost entirely the pipeline depth, the out-of-order engine, and the instruction-level parallelism, not the clock frequency.

The lesson: single-cycle is correct and understandable; pipelining is the engineering path from "correct" to "fast." CSA-201 starts down that path.

Reflection prompts

Your CPU has one clock domain, one fetch-decode-execute cycle per clock, and runs from BRAM. What would you have to add to run from external DRAM (which has a 10-100ns access latency vs the BRAM's 1-cycle latency)?
Lab 5.5 introduced a seeded bug in the branch-offset encoding. What made it hard to find? What tool or technique did you use to isolate it?
You have now built both a 6502 (by reading and modifying Arlet's Verilog) and an RV32I-Lite (by writing Verilog from scratch). Which required more engineering judgment? Which required more domain knowledge? Are those different things?

What's next

The hardware is done. Weeks 6-12 build the software toolchain. Week 6 builds the assembler: a Python program that takes RV32I-Lite assembly source and emits a VOF v1 object file. The assembler is the first step in converting human-readable instructions into the encoded 32-bit words you have been writing by hand.