Classroom Public page

Week 10: Compiler I (Syntax Analysis)

500 words

Build the compiler frontend. A tokenizer that reads source characters and emits tokens; a recursive-descent parser that builds a parse tree from tokens. By end of week your parser produces a structured tree for any valid Jack-like source program.


Reading

  • Chapter prose (primary). draft-chapters/ch9-compiler-syntax-analysis-prose.md
  • Petzold weave anchors. Ch 24 Languages High and Low p. 353 ("you must also write a compiler... read through a source-code file character by character"); Ch 24 p. 354 (returning visit; ALGOL as first formally-specified-grammar language); Ch 25 The Graphical Revolution p. 373 (returning visit; objects-containing-objects-as-trees parallels parse-trees-as-trees-of-trees)

Lecture

lectures/ch9-compiler-syntax-analysis-lecture.md. 3 hours. Key arc:

  • The tokenizer. A finite-state machine that consumes characters and emits typed tokens (identifier, number, symbol, keyword)
  • Grammar rules. A simple notation for "what is a valid program"
  • Recursive descent. One parser function per grammar rule; each function calls others to handle sub-rules
  • Parse trees. The output of the parser; the structured representation that the code generator (next week) walks
  • Why recursive descent. Easier to write by hand than LR parsers; matches the recursive structure of expressions naturally

Lab exercises

Four labs in worksheets/ch9/.

Plan for ~5 hours of lab.

Independent practice

  • Re-read Petzold Ch 24 pp. 353-354 for the compiler thesis. The "character by character" pattern is exactly what your tokenizer does
  • Update your Toolchain Diary. Week 10 introduces: regex for token recognition (or hand-rolled FSM), Python's dataclasses for parse-tree nodes, JSON or s-expression serialization for debugging
  • Stretch: read the dragon book (Aho, Sethi, Ullman) Ch 4 on parsing if you have access. Or read Crockford's Top Down Operator Precedence paper for an alternative parser style

Architecture comparison sidebar

Recursive-descent parsers are hand-written and ubiquitous in industry: gcc's C parser, Clang's, Rust's. LR parsers (yacc/bison generators) were dominant in the 1980s-1990s but have lost ground to hand-written parsers because of error-message quality and code clarity. The CSA-101 choice (hand-write recursive descent) reflects the modern industry default.

Reflection prompts

  1. The tokenizer and parser feel mechanical: there is one right answer for each input. Is parsing genuinely mechanical, or are there cases where a parser has to guess?
  2. Recursive descent matches the grammar recursively. What other software architectures match their problem domain recursively?
  3. The parse tree is the contract between the frontend (this week) and the backend (next week). What other contracts have you noticed in the toolchain so far?

What's next

Week 11 is the compiler backend. Walk the parse tree; emit VM bytecode. By end of week your compiler can compile a Jack-like source file all the way through to running code on your CPU. Eight layers of toolchain (source → tokens → parse tree → VM bytecode → assembly → object file → linked binary → silicon) all of which you wrote.