Classroom Glossary Public page

Week 3: Memory — Registers and RAM

1,171 words

Sequential logic. The flip-flop holds a bit across clock edges; eight flip-flops side by side hold a byte; thirty-two of them hold a 32-bit register. CSA-101 built an 8-bit register. CSA-110 builds a 32-bit one and then wires eight of them into a register file with two read ports and one write port.


Reading

  • Petzold weave anchors. Ch 14 (Feedback and Flip-Flops, p. 196), Ch 16 (Assembling Memory, p. 225). Ch 14 establishes the SR latch and the D flip-flop from relay logic; Ch 16 traces how flip-flops become addressable memory. ~35 pages. The flip-flop concept is the same as CSA-101 Week 4; read again and note how the move to 32 bits changes the register file but not the underlying logic.

Lecture

3 hours. Key arc:

The SR latch and metastability. The SR latch has two NAND gates feeding back into each other. If S and R are both asserted, the latch enters an undefined state. The clocked D flip-flop eliminates this by gating the SR inputs through the clock: data is only captured on the rising edge of CLK. CSA-101 students built this in the drag-drop simulator; here it appears in Verilog.

The D flip-flop in Verilog.

module dff (
    input  wire       clk,
    input  wire       d,
    output reg        q
);
    always @(posedge clk)
        q <= d;
endmodule

The always @(posedge clk) block is the key idiom for all sequential logic in this course. Note the <= (non-blocking assignment): it schedules the assignment to complete after all always blocks in the same time step are evaluated. Every register in your CPU uses this idiom.

The 32-bit register. Wire 32 DFFs in parallel: same clk, same reset, 32-bit d input, 32-bit q output. In Verilog: reg [31:0] q; always @(posedge clk) if (reset) q <= 0; else if (we) q <= d;. The we (write enable) signal is the equivalent of the 6502's register-load control signal.

The RV32I-Lite register file. Eight 32-bit registers (x0 through x7). Register x0 is hardwired zero: any write to x0 is ignored, and any read from x0 returns 0. Two asynchronous read ports (rs1, rs2); one synchronous write port (rd). The two-read-port design is what enables three-operand instructions: ADD rd, rs1, rs2 reads rs1 and rs2 simultaneously, computes the sum, and writes it to rd at the next clock edge.

module regfile (
    input  wire        clk,
    input  wire        we,
    input  wire [2:0]  rs1, rs2, rd,
    input  wire [31:0] wdata,
    output wire [31:0] rdata1, rdata2
);
    reg [31:0] rf [1:7];  // x0 is not stored; reads return 0

    assign rdata1 = (rs1 == 0) ? 32'b0 : rf[rs1];
    assign rdata2 = (rs2 == 0) ? 32'b0 : rf[rs2];

    always @(posedge clk)
        if (we && rd != 0) rf[rd] <= wdata;
endmodule

The byte-addressable RAM module. Virtus OS expects byte-addressable memory even though the CPU's minimum data transfer is 32 bits wide. The RAM module accepts a 32-bit address and a 2-bit size control (byte, halfword, word) and routes accordingly. You implement word-only access now; Lab 3.3 extends to byte.

The 6502 comparison. The 6502 has 3 general-purpose 8-bit registers (A, X, Y), a 16-bit PC, and an 8-bit SP. Its RAM is byte-addressable (the natural unit for 8-bit data). CSA-110's RV32I-Lite has 8 general-purpose 32-bit registers (CSA-201 expands to 32). The extra registers allow the compiler to keep more values live without spilling to memory; a key advantage of RISC that the register-allocator module in CSA-201 quantifies.


Lab exercises

Four labs in labs/lab-3.md. Plan for ~5 hours.

  • Lab 3.1. Write dff.v. Simulate with lab3_dff_tb.v, which clocks through 16 cycles and checks that the output tracks the input one cycle later. Watch the waveform with GTKWave to see the rising-edge capture.
  • Lab 3.2. Write regfile.v per the lecture template. Run the testbench that writes all 7 non-zero registers and reads them back. Verify that reading x0 always returns 0 even after a write attempt.
  • Lab 3.3. Extend your word-access RAM to mem.v with byte-addressable reads and writes. The size signal selects 1, 2, or 4 bytes. Zero-extend byte and halfword reads; sign-extend on the LB and LH variants. Run lab3_mem_tb.v.
  • Lab 3.4 (metastability drill). Using the waveform from Lab 3.1, identify the setup-time window (the period before the clock edge where d must be stable). Record in Toolchain Diary: what happens to the flip-flop if d changes inside the setup window? This is the metastability hazard that affects real FPGA clock-domain-crossing designs.

Independent practice

  • Read Petzold Ch 14 and Ch 16 together. Ch 14's relay-based flip-flop is the same state-holding element as your Verilog DFF; Ch 16 shows how addresses and decoders build a RAM from those elements.
  • Compare the RV32I-Lite register file to the 6502 register file you built (or used) in CSA-101. Write one paragraph in your Toolchain Diary: what would change about your Py6502v compiler from CSA-102 if the 6502 had had 8 registers instead of 3?
  • Optional: look up the RISC design philosophy (Patterson and Hennessy) and the statement "make the common case fast." How does having 8 registers instead of 3 implement this principle?

Architecture comparison sidebar

6502's 3-register file vs RV32I-Lite's 8-register file vs full RV32I's 32-register file.

The 6502 has three general-purpose registers of 8 bits each. Every operand that cannot fit in A, X, or Y must go to memory. The 6502 compensates with zero-page addressing (effectively fast 8-bit addresses for the first 256 bytes of RAM) and a complex addressing-mode suite that performs address arithmetic in the addressing unit, not the programmer's code. The result: 6502 assembly is tightly bound to memory layout in a way RV32I-Lite assembly is not.

RV32I-Lite has 8 general-purpose 32-bit registers. An arithmetic instruction like ADD x3, x1, x2 touches no memory at all: it reads x1 and x2, computes the sum, and writes x3. A moderate-complexity function with 8 local variables can often keep them entirely in registers with no spill. The 6502 equivalent requires careful zero-page allocation and frequent LDA/STA pairs.

Full RV32I has 32 registers. The RISC-V ABI assigns roles: x0 (zero), x1 (return address), x2 (stack pointer), x5-x7 and x28-x31 (temporaries), x8-x9 and x18-x27 (callee-saved), x10-x11 (function arguments and return values), x12-x17 (additional arguments). CSA-201 adds the register allocator that decides which of the 29 usable registers to assign to which variables.

The history: the 6502's 3-register design was a cost and complexity decision in 1975. The 8080 and Z80 had more registers but used them for address computation, not general arithmetic. The move to 32 registers in MIPS (1985) and RV32I (2010s) was the accumulated experience of 40 years of compiler and OS developers saying "we need more registers."


Reflection prompts

  1. The RV32I-Lite register file has x0 hardwired to zero. Why is this useful? Give two concrete examples from assembly programs where a hardwired-zero register saves instructions.
  2. You have 8 registers in CSA-110. Your compiler (Weeks 10-12) has to decide which variables live in registers and which spill to RAM. With 8 registers available, how many variables can a function keep in-flight without touching RAM? What changes when you have 32 (CSA-201)?
  3. The 6502's 3-register design forced the programmer to think about memory layout explicitly (zero page, page boundaries, index addressing). Is this a deficiency or a form of useful discipline? What do modern CPUs with 32 registers lose by hiding this detail?

What's next

Week 4 brings the instruction set. You have a working ALU (Week 2) and a register file (Week 3). Week 4 is the contract between those components and the software: the 11 RV32I-Lite instructions, their 32-bit encodings, and how to hand-write programs in assembly. By the end of Week 4, you will have hand-assembled a sum-to-N program and loaded it into a hex array that your Week-5 CPU will execute.