Classroom Glossary Public page

Lab 6: Assembler — Two-Pass Tokenizer and VOF Emitter

338 words

Week: 6
Points: 20
Time: ~5 hours
Deliverable: toolchain/assembler/ directory + round-trip verification output + diary/week-06.md


What you ship

  • toolchain/assembler/pass1.py — tokenizer and label collector
  • toolchain/assembler/encode.py — instruction encoders for all 11 instructions + 8 pseudo-instructions
  • toolchain/assembler/pass2.py — encoding pass
  • toolchain/assembler/emit_vof.py — VOF v1 emitter
  • toolchain/assembler/asm.py — command-line entry point
  • asm/week-06/sum-to-n-vof.vof — assembled output
  • lab6_roundtrip.txt — diff showing assembler output matches riscv64-linux-gnu-as output
  • diary/week-06.md

Lab 6.1: Pass 1 — tokenizer and label collector

Write pass1.py. Requirements:

  • Strip comments (everything after #)
  • Handle label definitions: lines ending with : record the label and its address in self.symbols
  • Handle directives: .text, .data, .globl (ignore for now; reserve for Week 7)
  • For each non-label, non-directive line: append (address, mnemonic, operands_list) to self.instructions
  • Increment self.lc by 4 for each instruction

Test with five programs including:

  • A program with forward references (labels defined after use)
  • A program with two labels pointing to the same address (adjacent labels)
  • A program with no labels
python3 -c "
from pass1 import RV32ILiteAssembler
asm = RV32ILiteAssembler()
asm.pass1(open('asm/week-06/test-forward-ref.s').read())
print('Symbols:', asm.symbols)
print('Instructions:', len(asm.instructions))
"

Lab 6.2: Encoder

Write encode.py. Implement encoders for all 11 RV32I-Lite instructions and the 8 pseudo-instructions:

Pseudo Expands to
LI rd, imm ADDI rd, x0, imm
MV rd, rs ADDI rd, rs, 0
NOP ADDI x0, x0, 0
NOT rd, rs XORI rd, rs, -1
NEG rd, rs SUB rd, x0, rs
J label JAL x0, label
RET JALR x0, x1, 0
CALL label Two instructions: AUIPC x1, upper; JALR x1, x1, lower (or use JAL if within range)

Unit-test each encoder:

# tests/test_encode.py
import unittest
from encode import encode_add, encode_addi, encode_beq, ...

class TestEncoders(unittest.TestCase):
    def test_add(self):
        # ADD x3, x1, x2
        self.assertEqual(encode_add(3, 1, 2), 0x00208133)  # (not the real hex; verify manually)
    
    def test_beq_forward(self):
        # BEQ x1, x2 with +8 offset
        instr = encode_beq(1, 2, 8)
        self.assertEqual(instr & 0x7F, 0b1100011)  # B-type opcode
        # Verify the immediate reassembly produces +8

Lab 6.3: Pass 2 and VOF emitter

Write pass2.py and emit_vof.py.

pass2.py iterates over self.instructions and calls the appropriate encoder for each mnemonic. For branch and jump instructions, it looks up the target in self.symbols and computes the PC-relative offset.

emit_vof.py writes the VOF v1 binary format: header (magic bytes VOF1), .text section, .symtab section, .reloc section. Use the VOF v1 spec at handouts/vof-v1-spec.md.

Assemble sum-to-n.s:

python3 toolchain/assembler/asm.py asm/week-06/sum-to-n.s -o asm/week-06/sum-to-n-vof.vof

Extract the .text section from the VOF and compare against Lab 4.4's verified hex:

python3 -c "
from emit_vof import read_vof
vof = read_vof('asm/week-06/sum-to-n-vof.vof')
print(' '.join(f'{b:02x}' for b in vof.text))
"

Lab 6.4: Round-trip verification

This is the assembler's correctness certificate. Assemble sum-to-n.s with your assembler and with riscv64-linux-gnu-as. Compare the text sections:

# Your assembler
python3 toolchain/assembler/asm.py asm/week-06/sum-to-n.s -o /tmp/yours.vof
python3 -c "from emit_vof import read_vof; v=read_vof('/tmp/yours.vof'); open('/tmp/yours.hex','wb').write(v.text)"

# GNU assembler
riscv64-linux-gnu-as -march=rv32i -mabi=ilp32 asm/week-06/sum-to-n.s -o /tmp/gnu.o
riscv64-linux-gnu-objdump -j .text -s /tmp/gnu.o | grep '^ ' | \
    awk '{print $2$3$4$5}' | xxd -r -p > /tmp/gnu.hex

diff <(xxd /tmp/yours.hex) <(xxd /tmp/gnu.hex) > lab6_roundtrip.txt
cat lab6_roundtrip.txt
# Expected: empty (no differences)

If there are differences: identify which instruction they're from, explain the discrepancy, and fix your encoder.


Toolchain Diary

Record in diary/week-06.md:

  • Whether you used the one-pass or two-pass strategy initially and what forced you to two passes
  • Comparison to your CSA-102 6502 assembler: line count, complexity, what's different
  • The round-trip result: any discrepancies found and fixed

Grading

Component Points
pass1.py: tokenizer handles forward references, adjacent labels, comments 5
encode.py: all 11 instructions + 8 pseudo-instructions with unit tests 8
Round-trip verification: lab6_roundtrip.txt shows zero differences 5
Toolchain Diary: CSA-102 comparison 2