Lab 6: Assembler — Two-Pass Tokenizer and VOF Emitter · CSA-110

Week: 6
Points: 20
Time: ~5 hours
Deliverable: toolchain/assembler/ directory + round-trip verification output + diary/week-06.md

What you ship

toolchain/assembler/pass1.py — tokenizer and label collector
toolchain/assembler/encode.py — instruction encoders for all 11 instructions + 8 pseudo-instructions
toolchain/assembler/pass2.py — encoding pass
toolchain/assembler/emit_vof.py — VOF v1 emitter
toolchain/assembler/asm.py — command-line entry point
asm/week-06/sum-to-n-vof.vof — assembled output
lab6_roundtrip.txt — diff showing assembler output matches riscv64-linux-gnu-as output
diary/week-06.md

Lab 6.1: Pass 1 — tokenizer and label collector

Write pass1.py. Requirements:

Strip comments (everything after #)
Handle label definitions: lines ending with : record the label and its address in self.symbols
Handle directives: .text, .data, .globl (ignore for now; reserve for Week 7)
For each non-label, non-directive line: append (address, mnemonic, operands_list) to self.instructions
Increment self.lc by 4 for each instruction

Test with five programs including:

A program with forward references (labels defined after use)
A program with two labels pointing to the same address (adjacent labels)
A program with no labels

python3 -c "
from pass1 import RV32ILiteAssembler
asm = RV32ILiteAssembler()
asm.pass1(open('asm/week-06/test-forward-ref.s').read())
print('Symbols:', asm.symbols)
print('Instructions:', len(asm.instructions))
"

Lab 6.2: Encoder

Write encode.py. Implement encoders for all 11 RV32I-Lite instructions and the 8 pseudo-instructions:

Pseudo	Expands to
`LI rd, imm`	`ADDI rd, x0, imm`
`MV rd, rs`	`ADDI rd, rs, 0`
`NOP`	`ADDI x0, x0, 0`
`NOT rd, rs`	`XORI rd, rs, -1`
`NEG rd, rs`	`SUB rd, x0, rs`
`J label`	`JAL x0, label`
`RET`	`JALR x0, x1, 0`
`CALL label`	Two instructions: `AUIPC x1, upper; JALR x1, x1, lower` (or use JAL if within range)

Unit-test each encoder:

# tests/test_encode.py
import unittest
from encode import encode_add, encode_addi, encode_beq, ...

class TestEncoders(unittest.TestCase):
    def test_add(self):
        # ADD x3, x1, x2
        self.assertEqual(encode_add(3, 1, 2), 0x00208133)  # (not the real hex; verify manually)
    
    def test_beq_forward(self):
        # BEQ x1, x2 with +8 offset
        instr = encode_beq(1, 2, 8)
        self.assertEqual(instr & 0x7F, 0b1100011)  # B-type opcode
        # Verify the immediate reassembly produces +8

Lab 6.3: Pass 2 and VOF emitter

Write pass2.py and emit_vof.py.

pass2.py iterates over self.instructions and calls the appropriate encoder for each mnemonic. For branch and jump instructions, it looks up the target in self.symbols and computes the PC-relative offset.

emit_vof.py writes the VOF v1 binary format: header (magic bytes VOF1), .text section, .symtab section, .reloc section. Use the VOF v1 spec at handouts/vof-v1-spec.md.

Assemble sum-to-n.s:

python3 toolchain/assembler/asm.py asm/week-06/sum-to-n.s -o asm/week-06/sum-to-n-vof.vof

Extract the .text section from the VOF and compare against Lab 4.4's verified hex:

python3 -c "
from emit_vof import read_vof
vof = read_vof('asm/week-06/sum-to-n-vof.vof')
print(' '.join(f'{b:02x}' for b in vof.text))
"

Lab 6.4: Round-trip verification

This is the assembler's correctness certificate. Assemble sum-to-n.s with your assembler and with riscv64-linux-gnu-as. Compare the text sections:

# Your assembler
python3 toolchain/assembler/asm.py asm/week-06/sum-to-n.s -o /tmp/yours.vof
python3 -c "from emit_vof import read_vof; v=read_vof('/tmp/yours.vof'); open('/tmp/yours.hex','wb').write(v.text)"

# GNU assembler
riscv64-linux-gnu-as -march=rv32i -mabi=ilp32 asm/week-06/sum-to-n.s -o /tmp/gnu.o
riscv64-linux-gnu-objdump -j .text -s /tmp/gnu.o | grep '^ ' | \
    awk '{print $2$3$4$5}' | xxd -r -p > /tmp/gnu.hex

diff <(xxd /tmp/yours.hex) <(xxd /tmp/gnu.hex) > lab6_roundtrip.txt
cat lab6_roundtrip.txt
# Expected: empty (no differences)

If there are differences: identify which instruction they're from, explain the discrepancy, and fix your encoder.

Toolchain Diary

Record in diary/week-06.md:

Whether you used the one-pass or two-pass strategy initially and what forced you to two passes
Comparison to your CSA-102 6502 assembler: line count, complexity, what's different
The round-trip result: any discrepancies found and fixed

Grading

Component	Points
`pass1.py`: tokenizer handles forward references, adjacent labels, comments	5
`encode.py`: all 11 instructions + 8 pseudo-instructions with unit tests	8
Round-trip verification: `lab6_roundtrip.txt` shows zero differences	5
Toolchain Diary: CSA-102 comparison	2