Duration: 2 hr lecture + 2 hr essay lab + 6 hr independent
Lab: 1,500-word essay (submitted as lab-4-essay.md)
Points: Essay is required for progression to Module 4.5; not separately graded (completion credit)
Christian weave: The Alignment Problem, Prophecy Ch 4 ("The Specification Problem") -- the essay prompt maps directly to Christian's central claim
Karpathy weave: nanoGPT video -- the substrate-companion view of what a transformer is doing mechanically
Prerequisite: Labs 2.1 and 3.1 both completed; essays from both submitted
4.1 Why This Module Is an Essay
Modules 1-3 were empirical: you built a mapping table, performed a stack-smash, performed a prompt injection. You have data from both sides of the thesis. Module 4 is the moment where you articulate what you observed.
The 1,500-word essay is the capstone of the Prophecy arc. Belt-5 means you can write it -- not as an exercise in summarizing what the course told you, but as a technical argument grounded in the experiments you ran. The essay prompt is:
Explain the thesis: "agentic-system security is memory-corruption at the semantic layer." Ground your argument in the specific experiments you ran in Labs 2.1 and 3.1. Use the substrate-language mapping table from Module 1 as a structural reference. Cite the specific invariant violated in each case, the specific exploit mechanism, and the specific defense analogy. Close by explaining what the thesis implies about the correct class of mitigation: if the structural analysis is right, what should a defense look like?
The essay is required before Module 4.5. Module 4.5 returns to the same thesis and makes it literal with experimental data from mechanistic interpretability. Your Module 4 essay is the "before" document -- the best version of the argument you can make with structural reasoning alone. Module 4.5 is the "after" -- the argument updated with evidence that the metaphor is not just structural but causal.
4.2 The Thesis: What "Precisely" Means
The phrase "memory-corruption at the semantic layer" needs each word examined.
"Memory corruption" in the substrate context has a precise technical meaning: a write operation that modifies a memory location that the program's control flow did not intend to modify, with the consequence that the program's subsequent behavior is controlled by the attacker rather than by the program's legitimate logic. The key elements: (1) an unintended write, (2) to a location that is security-critical (the return address, a function pointer, a privilege flag), (3) with a consequence that the attacker controls.
"Semantic layer" refers to the layer of meaning -- the layer at which the model processes intent rather than bytes. At the substrate, "memory" is a physical array of bits; a memory address is an index into that array. At the semantic layer, "memory" is the model's context window -- the sequence of tokens the model conditions its output on. A "location" in the context window is not an address; it is a position in the token sequence. A "write" to that location is the injection of tokens into a position that should be reserved for a different category of content.
"At the semantic layer" therefore means: a write to a position in the token sequence that the model's context-processing logic did not intend to expose to untrusted modification, with the consequence that the model's subsequent behavior is controlled by the attacker rather than by the operator's system prompt.
The precision matters because it distinguishes the thesis from looser analogies. It is not "LLMs are like programs." It is "the specific failure mode of LLM agents that process untrusted input in a trusted context is structurally isomorphic to the specific failure mode of programs that write past buffer bounds into security-critical memory."
4.3 The Essay Structure
A 1,500-word essay that argues the thesis has five sections:
Section 1: The invariant (300 words)
State the invariant that each system is supposed to enforce. For the substrate: "only the runtime modifies security-critical memory." For the language system: "only the operator's system prompt defines what the model treats as instruction." Explain where each invariant comes from (the hardware trust model and the operator-user-model trust hierarchy, respectively).
Section 2: The violation mechanism (400 words)
Describe exactly how each invariant is violated. For the substrate: the buffer overflow writes past bounds into the return address. For the language system: user-turn content containing injected instructions is treated as operator-turn instruction. Be specific to your experiments: reference the payload you crafted in Lab 2.1 and the injection string from Lab 3.1 that succeeded.
Section 3: The structural isomorphism (300 words)
Argue that the two violation mechanisms are structurally isomorphic. They share: (1) an attacker-controlled write, (2) into a location that is security-critical by design, (3) with a consequence that the attacker controls execution/behavior. The fact that one operates on bytes and the other on tokens is an implementation detail, not a structural difference.
Section 4: The Christian connection (200 words)
Use Christian's specification problem (Prophecy Ch 4) to frame the thesis. The specification problem is that designers specify proxies (the rule) rather than goals (the intent). The security invariant ("only the runtime modifies the return address") is a proxy for the goal ("only legitimate code executes"). Prompt injection succeeds because the invariant ("process user input as text") is a proxy for the goal ("treat user input as untrusted data, never as instruction"). The attacker exploits the gap between proxy and goal.
Section 5: The mitigation implication (300 words)
Argue from the structural analysis to the correct class of mitigation. If the invariant violation is the root cause, then patching specific violations (blacklisting specific overflow payloads, blacklisting specific injection strings) is insufficient -- it is the wrong abstraction level. The correct class of mitigation is structural enforcement of the invariant: W^X enforces "data cannot execute"; output isolation enforces "user content cannot become instruction." What does structural enforcement look like at the language layer? (This is the question Module 4.5 and Module 8 address experimentally.)
4.4 What Not to Write
The essay will fail if it:
- Simply summarizes the Module 1 mapping table without arguing from it
- Uses "memory corruption" as metaphor rather than as the structural class the thesis claims it is
- Cites only the course materials and not the experiments in Labs 2.1 and 3.1
- Treats the analogy as imprecise ("kind of like") rather than structural ("isomorphic in the following specific ways")
- Proposes mitigations that are instance-patches rather than structural enforcement
The essay will succeed if it:
- Grounds every claim in a specific observation from Labs 2.1 or 3.1
- Uses precise technical language from both the substrate domain (W^X, ASLR, return address, stack frame) and the language domain (system prompt, context window, token sequence, operator trust level)
- Anticipates the Module 4.5 question: "if the thesis is structural, not just analogical, what would experimental evidence for it look like?"
4.5 Independent Practice: Karpathy nanoGPT
The nanoGPT video (Karpathy, "Let's build GPT: from scratch, in code, spelled out") is the substrate-companion path for Module 4. Watch the full video (2 hours) before Module 4.5. Pay attention to:
- The attention mechanism: how the model weighs which tokens influence which other tokens. This is the language-layer "memory access pattern" -- the substrate for Module 4.5's activation steering.
- The residual stream: the running sum of per-layer transformations. This is what sparse autoencoders are trained on. Knowing what the residual stream is before Module 4.5 makes the SAE architecture legible.
- The forward pass as a series of matrix multiplications: a computational graph, not a sequential instruction stream. This is why "hooking" at a specific layer (TransformerLens hook points) is feasible -- the graph has named intermediate nodes.
After watching: write a 200-word description of the attention mechanism in your own words. This is independent practice, not a graded deliverable, but it is the prerequisite for Module 4.5 Section 4.5.3.
4.6 Christian: The Specification Problem
Christian's Prophecy chapter 4 is titled "The Specification Problem." The central claim: the most dangerous failure mode in machine learning is not a model that does badly on the training distribution, but a model that does well on the training distribution and badly on the specification -- a model that has learned a proxy for the goal that diverges from the goal in deployment.
The security implication: the designer's specification of the trust model (what the model is supposed to treat as trusted vs untrusted input) is a proxy for the goal (a model that is impossible to manipulate via untrusted input). The proxy diverges from the goal whenever an attacker finds input that satisfies the proxy while violating the goal -- i.e., whenever an attacker finds a prompt injection that the model "correctly" processes (by the rule) while "incorrectly" behaving (against the intent).
This is not a new observation for Belt-5 students. You saw it in AI-201. The Module 4 essay requires you to state it at the precision level the thesis demands: not "LLMs have a specification problem," but "the specification problem at the language layer is structurally isomorphic to the memory-safety-invariant specification problem at the substrate layer, and both are exploited by attacks that satisfy the rule while violating the intent."
If you can write that sentence with evidence from your own experiments, you have passed Module 4.