Classroom Glossary Public page

Week 10: Compiler I — Syntax Analysis

770 words

The compiler transforms source text into VM bytecode. It does this in two phases: the front end reads source text and produces a structured representation (a parse tree); the back end walks the parse tree and emits VM commands. Week 10 is the front end — tokenization and recursive-descent parsing. The front end does not know or care about RV32I-Lite or stack machines. Its job is to understand the grammar of VirtusLang and produce a tree that the back end can walk.


Reading

  • Petzold Ch 24 ("Languages High and Low," pp. 333-360): The moment where Petzold describes how a high-level program description becomes a sequence of machine instructions. This is the chapter the whole course has been building toward. The tokenizer and parser you write this week are the first two stages of the pipeline Petzold describes. (~28 pages)

Lecture

3 hours. Key arc:

The VirtusLang grammar (fragment for Weeks 10-12).

class:          'class' className '{' classVarDec* subroutineDec* '}'
classVarDec:    ('static' | 'field') type varName (',' varName)* ';'
subroutineDec:  ('constructor' | 'function' | 'method')
                ('void' | type) subroutineName '(' parameterList ')'
                subroutineBody
parameterList:  (type varName (',' type varName)*)?
subroutineBody: '{' varDec* statements '}'
varDec:         'var' type varName (',' varName)* ';'
type:           'int' | 'char' | 'boolean' | className

statements:     statement*
statement:      letStatement | ifStatement | whileStatement |
                doStatement | returnStatement

letStatement:   'let' varName ('[' expression ']')? '=' expression ';'
ifStatement:    'if' '(' expression ')' '{' statements '}'
                ('else' '{' statements '}')?
whileStatement: 'while' '(' expression ')' '{' statements '}'
doStatement:    'do' subroutineCall ';'
returnStatement:'return' expression? ';'

expression:     term (op term)*
op:             '+' | '-' | '*' | '/' | '&' | '|' | '<' | '>' | '='
term:           integerConstant | stringConstant | keywordConstant |
                varName | varName '[' expression ']' | subroutineCall |
                '(' expression ')' | unaryOp term
unaryOp:        '-' | '~'
keywordConstant:'true' | 'false' | 'null' | 'this'
subroutineCall: subroutineName '(' expressionList ')' |
                (className | varName) '.' subroutineName '(' expressionList ')'
expressionList: (expression (',' expression)*)?

The tokenizer. The tokenizer reads source text and emits a flat stream of tokens. Each token has a type and a value:

import re
from dataclasses import dataclass
from enum import Enum, auto

class TokenType(Enum):
    KEYWORD     = auto()
    SYMBOL      = auto()
    INTEGER     = auto()
    STRING      = auto()
    IDENTIFIER  = auto()

KEYWORDS = {
    'class', 'constructor', 'function', 'method', 'field', 'static',
    'var', 'int', 'char', 'boolean', 'void', 'true', 'false', 'null',
    'this', 'let', 'do', 'if', 'else', 'while', 'return',
}

SYMBOLS = set('{}()[].,;+-*/&|<>=~')

@dataclass
class Token:
    type: TokenType
    value: str

class Tokenizer:
    def __init__(self, source: str):
        # Strip // line comments and /* block comments */
        source = re.sub(r'//[^\n]*', '', source)
        source = re.sub(r'/\*.*?\*/', '', source, flags=re.DOTALL)
        self._tokens: list[Token] = []
        self._pos = 0
        self._tokenize(source)
    
    def _tokenize(self, source: str):
        i = 0
        while i < len(source):
            if source[i].isspace():
                i += 1
            elif source[i] == '"':
                end = source.index('"', i + 1)
                self._tokens.append(Token(TokenType.STRING, source[i+1:end]))
                i = end + 1
            elif source[i] in SYMBOLS:
                self._tokens.append(Token(TokenType.SYMBOL, source[i]))
                i += 1
            elif source[i].isdigit():
                j = i
                while j < len(source) and source[j].isdigit():
                    j += 1
                self._tokens.append(Token(TokenType.INTEGER, source[i:j]))
                i = j
            elif source[i].isalpha() or source[i] == '_':
                j = i
                while j < len(source) and (source[j].isalnum() or source[j] == '_'):
                    j += 1
                word = source[i:j]
                ttype = TokenType.KEYWORD if word in KEYWORDS else TokenType.IDENTIFIER
                self._tokens.append(Token(ttype, word))
                i = j
            else:
                raise SyntaxError(f"Unexpected character: {source[i]!r}")
    
    def has_more(self) -> bool:
        return self._pos < len(self._tokens)
    
    def peek(self) -> Token:
        return self._tokens[self._pos]
    
    def advance(self) -> Token:
        t = self._tokens[self._pos]
        self._pos += 1
        return t
    
    def expect(self, ttype: TokenType, value: str = None) -> Token:
        t = self.advance()
        if t.type != ttype:
            raise SyntaxError(f"Expected {ttype}, got {t.type} ({t.value!r})")
        if value is not None and t.value != value:
            raise SyntaxError(f"Expected {value!r}, got {t.value!r}")
        return t

The recursive-descent parser. Each grammar rule becomes one method:

from dataclasses import dataclass, field as dc_field
from typing import Optional

@dataclass
class ParseNode:
    tag: str
    children: list = dc_field(default_factory=list)
    value: Optional[str] = None

class Parser:
    def __init__(self, tokenizer: Tokenizer):
        self.tok = tokenizer
    
    def parse_class(self) -> ParseNode:
        node = ParseNode('class')
        self.tok.expect(TokenType.KEYWORD, 'class')
        node.children.append(ParseNode('className', value=self.tok.expect(TokenType.IDENTIFIER).value))
        self.tok.expect(TokenType.SYMBOL, '{')
        while self.tok.has_more() and self.tok.peek().value in ('static', 'field'):
            node.children.append(self.parse_class_var_dec())
        while self.tok.has_more() and self.tok.peek().value in ('constructor', 'function', 'method'):
            node.children.append(self.parse_subroutine_dec())
        self.tok.expect(TokenType.SYMBOL, '}')
        return node
    
    def parse_class_var_dec(self) -> ParseNode:
        node = ParseNode('classVarDec')
        node.children.append(ParseNode('keyword', value=self.tok.advance().value))  # static|field
        node.children.append(self.parse_type())
        node.children.append(ParseNode('varName', value=self.tok.expect(TokenType.IDENTIFIER).value))
        while self.tok.peek().value == ',':
            self.tok.advance()
            node.children.append(ParseNode('varName', value=self.tok.expect(TokenType.IDENTIFIER).value))
        self.tok.expect(TokenType.SYMBOL, ';')
        return node
    
    def parse_type(self) -> ParseNode:
        t = self.tok.advance()
        if t.type in (TokenType.KEYWORD, TokenType.IDENTIFIER):
            return ParseNode('type', value=t.value)
        raise SyntaxError(f"Expected type, got {t!r}")
    
    def parse_subroutine_dec(self) -> ParseNode:
        node = ParseNode('subroutineDec')
        node.children.append(ParseNode('keyword', value=self.tok.advance().value))  # constructor|function|method
        node.children.append(self.parse_type())  # return type
        node.children.append(ParseNode('subroutineName', value=self.tok.expect(TokenType.IDENTIFIER).value))
        self.tok.expect(TokenType.SYMBOL, '(')
        node.children.append(self.parse_parameter_list())
        self.tok.expect(TokenType.SYMBOL, ')')
        node.children.append(self.parse_subroutine_body())
        return node
    
    def parse_parameter_list(self) -> ParseNode:
        node = ParseNode('parameterList')
        if self.tok.peek().value != ')':
            node.children.append(self.parse_type())
            node.children.append(ParseNode('varName', value=self.tok.expect(TokenType.IDENTIFIER).value))
            while self.tok.peek().value == ',':
                self.tok.advance()
                node.children.append(self.parse_type())
                node.children.append(ParseNode('varName', value=self.tok.expect(TokenType.IDENTIFIER).value))
        return node
    
    def parse_subroutine_body(self) -> ParseNode:
        node = ParseNode('subroutineBody')
        self.tok.expect(TokenType.SYMBOL, '{')
        while self.tok.peek().value == 'var':
            node.children.append(self.parse_var_dec())
        node.children.append(self.parse_statements())
        self.tok.expect(TokenType.SYMBOL, '}')
        return node
    
    def parse_var_dec(self) -> ParseNode:
        node = ParseNode('varDec')
        self.tok.expect(TokenType.KEYWORD, 'var')
        node.children.append(self.parse_type())
        node.children.append(ParseNode('varName', value=self.tok.expect(TokenType.IDENTIFIER).value))
        while self.tok.peek().value == ',':
            self.tok.advance()
            node.children.append(ParseNode('varName', value=self.tok.expect(TokenType.IDENTIFIER).value))
        self.tok.expect(TokenType.SYMBOL, ';')
        return node
    
    def parse_statements(self) -> ParseNode:
        node = ParseNode('statements')
        STMT_KEYWORDS = {'let', 'if', 'while', 'do', 'return'}
        while self.tok.has_more() and self.tok.peek().value in STMT_KEYWORDS:
            kw = self.tok.peek().value
            if kw == 'let':     node.children.append(self.parse_let())
            elif kw == 'if':    node.children.append(self.parse_if())
            elif kw == 'while': node.children.append(self.parse_while())
            elif kw == 'do':    node.children.append(self.parse_do())
            elif kw == 'return':node.children.append(self.parse_return())
        return node
    
    def parse_let(self) -> ParseNode:
        node = ParseNode('letStatement')
        self.tok.expect(TokenType.KEYWORD, 'let')
        node.children.append(ParseNode('varName', value=self.tok.expect(TokenType.IDENTIFIER).value))
        if self.tok.peek().value == '[':
            self.tok.advance()
            node.children.append(self.parse_expression())
            self.tok.expect(TokenType.SYMBOL, ']')
        self.tok.expect(TokenType.SYMBOL, '=')
        node.children.append(self.parse_expression())
        self.tok.expect(TokenType.SYMBOL, ';')
        return node
    
    def parse_if(self) -> ParseNode:
        node = ParseNode('ifStatement')
        self.tok.expect(TokenType.KEYWORD, 'if')
        self.tok.expect(TokenType.SYMBOL, '(')
        node.children.append(self.parse_expression())
        self.tok.expect(TokenType.SYMBOL, ')')
        self.tok.expect(TokenType.SYMBOL, '{')
        node.children.append(self.parse_statements())
        self.tok.expect(TokenType.SYMBOL, '}')
        if self.tok.has_more() and self.tok.peek().value == 'else':
            self.tok.advance()
            self.tok.expect(TokenType.SYMBOL, '{')
            node.children.append(self.parse_statements())
            self.tok.expect(TokenType.SYMBOL, '}')
        return node
    
    def parse_while(self) -> ParseNode:
        node = ParseNode('whileStatement')
        self.tok.expect(TokenType.KEYWORD, 'while')
        self.tok.expect(TokenType.SYMBOL, '(')
        node.children.append(self.parse_expression())
        self.tok.expect(TokenType.SYMBOL, ')')
        self.tok.expect(TokenType.SYMBOL, '{')
        node.children.append(self.parse_statements())
        self.tok.expect(TokenType.SYMBOL, '}')
        return node
    
    def parse_do(self) -> ParseNode:
        node = ParseNode('doStatement')
        self.tok.expect(TokenType.KEYWORD, 'do')
        node.children.append(self.parse_subroutine_call())
        self.tok.expect(TokenType.SYMBOL, ';')
        return node
    
    def parse_return(self) -> ParseNode:
        node = ParseNode('returnStatement')
        self.tok.expect(TokenType.KEYWORD, 'return')
        if self.tok.peek().value != ';':
            node.children.append(self.parse_expression())
        self.tok.expect(TokenType.SYMBOL, ';')
        return node
    
    def parse_expression(self) -> ParseNode:
        node = ParseNode('expression')
        node.children.append(self.parse_term())
        OPS = set('+-*/&|<>=')
        while self.tok.has_more() and self.tok.peek().value in OPS:
            node.children.append(ParseNode('op', value=self.tok.advance().value))
            node.children.append(self.parse_term())
        return node
    
    def parse_term(self) -> ParseNode:
        t = self.tok.peek()
        
        if t.type == TokenType.INTEGER:
            self.tok.advance()
            return ParseNode('integerConstant', value=t.value)
        if t.type == TokenType.STRING:
            self.tok.advance()
            return ParseNode('stringConstant', value=t.value)
        if t.type == TokenType.KEYWORD and t.value in ('true', 'false', 'null', 'this'):
            self.tok.advance()
            return ParseNode('keywordConstant', value=t.value)
        if t.value == '(':
            self.tok.advance()
            node = ParseNode('parenExpr')
            node.children.append(self.parse_expression())
            self.tok.expect(TokenType.SYMBOL, ')')
            return node
        if t.value in ('-', '~'):
            self.tok.advance()
            node = ParseNode('unaryOp', value=t.value)
            node.children.append(self.parse_term())
            return node
        if t.type == TokenType.IDENTIFIER:
            name = self.tok.advance().value
            if self.tok.peek().value == '[':
                self.tok.advance()
                node = ParseNode('arrayAccess')
                node.children.append(ParseNode('varName', value=name))
                node.children.append(self.parse_expression())
                self.tok.expect(TokenType.SYMBOL, ']')
                return node
            if self.tok.peek().value in ('(', '.'):
                return self.parse_subroutine_call(leading_name=name)
            return ParseNode('varName', value=name)
        
        raise SyntaxError(f"Unexpected token in term: {t!r}")
    
    def parse_subroutine_call(self, leading_name=None) -> ParseNode:
        node = ParseNode('subroutineCall')
        if leading_name is None:
            leading_name = self.tok.expect(TokenType.IDENTIFIER).value
        
        if self.tok.peek().value == '.':
            self.tok.advance()
            method = self.tok.expect(TokenType.IDENTIFIER).value
            node.children.append(ParseNode('receiver', value=leading_name))
            node.children.append(ParseNode('method', value=method))
        else:
            node.children.append(ParseNode('function', value=leading_name))
        
        self.tok.expect(TokenType.SYMBOL, '(')
        node.children.append(self.parse_expression_list())
        self.tok.expect(TokenType.SYMBOL, ')')
        return node
    
    def parse_expression_list(self) -> ParseNode:
        node = ParseNode('expressionList')
        if self.tok.peek().value != ')':
            node.children.append(self.parse_expression())
            while self.tok.peek().value == ',':
                self.tok.advance()
                node.children.append(self.parse_expression())
        return node

Lab exercises

Three labs in labs/lab-10.md. Plan for ~4 hours.

  • Lab 10.1. Finalize toolchain/compiler/tokenizer.py. Test on SquareMain.vl and compare the token stream against the reference in tests/compiler/SquareMain.tokens.
  • Lab 10.2. Finalize toolchain/compiler/parser.py. Test on SquareMain.vl and compare the XML parse tree against tests/compiler/SquareMain.xml. The round-trip test: parse → serialize to XML → diff against reference.
  • Lab 10.3 (forward promise: Ghidra). Parse ExpressionLessSquare.vl — a VirtusLang class with no complex expressions, making the generated VM code easy to read. You will return to this file in Lab 11 after adding code generation, and compare the Ghidra disassembly of your generated code against what you expect from the parse tree now.

Independent practice

  • Read the nand2tetris project website's "Compiler I" section for comparison context. Note where the VirtusLang grammar differs from Jack (the nand2tetris language); these differences are design choices, not errors.
  • Review your Week 6 assembler's tokenizer. Your compiler's tokenizer is architecturally identical — same comment stripping, same token-type classification. The difference is in what the two tokenizers produce: the assembler tokenizer produces mnemonic/register/immediate tokens; the compiler tokenizer produces keyword/symbol/integer/string/identifier tokens.

Architecture comparison sidebar

CSA-102 Py6502v tokenizer vs CSA-110 VirtusLang tokenizer.

The tokenizer in CSA-102's Py6502v compiler had the same structure: strip comments, classify characters, emit tokens. The grammar was simpler — Py6502v had procedures, not classes; it had no field declarations or method dispatch. But the tokenizer's mechanism — regex-free character-by-character scan with a state machine — was identical to what you wrote this week.

The target grammar is what changed. Py6502v's grammar fit on half a page; VirtusLang's grammar above is three times larger. The recursive-descent parser scales with grammar complexity in a natural way: each grammar rule becomes one method, each choice becomes an if/elif, each sequence becomes a series of expect and recursive calls. The parser does not get harder to understand as the grammar grows — it gets longer but not more complex in structure.

Compare this to a table-driven parser (LR(1) or LALR): the grammar is specified separately from the parsing code; the parser generator produces a parsing table; the parser is a loop that consults the table. Table-driven parsers handle a larger class of grammars (including left-recursive grammars that defeat recursive descent), but the generated code is unreadable without the original grammar specification. Production compilers (GCC historically, Clang currently) use recursive descent for the front end and table-driven parsing for grammar validation — the same hybrid you will use if you extend VirtusLang in a later project.


Reflection prompts

  1. The parse_expression method handles operator precedence by treating all operators at the same level (left-to-right). What would it take to add precedence levels (so * binds tighter than +)? Describe the structural change to parse_expression — no code required.
  2. The tokenizer strips both // and /* */ comments before tokenizing. What would break if it stripped comments after tokenizing? (Think about string literals that contain the comment delimiters.)
  3. Your parser raises SyntaxError on an unexpected token. What information would you add to the error message to help a VirtusLang programmer find their mistake? (Line numbers, source context, expected token list.) This is the UX half of language design; the answer is in every error message Python produces.

What's next

Week 11 is the compiler back end: symbol table construction and VM code generation. The parse tree you produced this week is the input; the VM bytecode from Weeks 8-9 is the output. The symbol table is the bridge — it tracks which variables exist, what their types are, and how to access them in the VM's memory segment model.