Duration: 3 hr lecture + 12 hr capstone lab + 6 hr independent
Points: 100 (Capstone; see CAPSTONE.md)
MITRE ATLAS cross-reference: Full engagement scenario (all tactics in sequence)
Required reading:
- Review CAPSTONE.md in full before Week 12 lab session 1
- Review your Module 9 OWASP audit report with all updates through Module 11
- Review your Module 4.5 SAE experiment results (Track B students)
- Review Anthropic RSP v3.0 §4 (Track C students)
Christian weave: The Alignment Problem, full arc synthesis -- Prophecy through Agency through Normativity; Ch 13's closing reflection on what "alignment" means once you have seen the full picture of how specification, agency, and values interact under adversarial pressure
Prerequisite: Modules 1-11 complete; all prior lab results in hand
12.1 The Course Arc: What Was Built
AI-301 has three arcs, and Module 12 is where they converge:
| Arc | Modules | What was demonstrated |
|---|---|---|
| Prophecy | 1-4 | The substrate-language mapping. Stack-smash = prompt injection; return address = next-token prediction; ROP = indirect context injection; activation steering = memory corruption at the semantic layer. The metaphor was made literal by Module 4.5's SAE feature clamping. |
| Agency | 5-7.5 | Agents with tools can cause real harm. Tool-chain hijack via type confusion; information leakage via latency side-channels; alignment degradation via fine-tuning supply-chain attack. The Christian arc's Agency section (Ch 10: "banality of evil") was the structural frame. |
| Normativity | 8-11 | Multi-agent topologies multiply attack surfaces. RAG systems trust external knowledge in ways that are exploitable. Belt-5 deployment posture is a commitment to values, not a configuration checklist. |
The convergence point. By Module 11, the student has attacked a system at every layer: the prompt (M3, M5), the tool interface (M6), the inference timing (M7), the training pipeline (M7.5), the agent topology (M8), the knowledge retrieval (M10), and the deployment posture (M11). The capstone asks: now design a coherent investigation that demonstrates you understand how these layers interact.
12.2 The Three-Track Capstone Structure
Full specification in CAPSTONE.md. Brief track descriptions:
Track A: Exploit chain. Build a working multi-stage attack against the DVLA: at minimum, a prompt injection that triggers a tool-chain exploitation that exfiltrates a controlled secret. The chain must cross at least two OWASP categories. The Tier 1 gate: python3 chain_exploit.py --demo-mode runs the chain and produces observable output. The Tier 2 rubric evaluates how cleanly the chain is explained, whether the substrate-language mapping is made explicit, and whether the engineering quality is reproducible.
Track B: SAE defense analysis. Using the SAE experiment results from Module 4.5, design and implement a monitoring system that detects at least one of the Module 5-10 attack classes via feature activation anomaly. The Tier 1 gate: jupyter nbconvert --to notebook --execute sae_analysis.ipynb runs cleanly and produces detection results. The Tier 2 rubric evaluates the statistical validity of the detection (false-positive rate, sensitivity), the completeness of the substrate-language mapping, and the quality of the technical write-up.
Track C: Capability-eval design. Design a structured capability evaluation protocol for one of the attack classes from Modules 8-10, using the RSP v3.0 pre-deployment evaluation framework as the structural model. The Tier 1 gate: python3 eval_runner.py --models all --output eval_results/ runs against the DVLA and produces a structured results file. The Tier 2 rubric evaluates the rigor of the evaluation design (coverage, validity, reproducibility), the completeness of the threat-actor capability matrix, and the quality of the recommendations section.
12.3 The Week-by-Week Capstone Schedule
Module 12 spans three lab sessions (Weeks 12, 13, and 14).
Week 12 lab session:
- Select your capstone track (final)
- Write the Tier 1 gate code and verify it runs
- Draft the capstone outline (2-page maximum): what you will demonstrate, what the substrate-language mapping will be, what the expected result is
- Check: does your planned Tier 1 gate produce observable output? If not, simplify until it does
Week 13 lab session:
- Instructor-run Tier 1 gate check: your Tier 1 command runs on the instructor's machine and produces output
- Begin Tier 2 analysis
- If Tier 1 fails: identify the specific blocker; fix it before Week 14
- Update your Module 9 OWASP audit report with any new findings from your capstone investigation
Week 14 lab session (demo day):
- Final Tier 1 gate demonstration (live)
- Tier 2 analysis complete; write-up submitted
- 15-minute Q&A with instructor: be prepared to explain any line of code in your submission
12.4 The Substrate-Language Mapping Requirement
Every capstone track requires an explicit substrate-language mapping. This is not an optional section; it is 30% of the Tier 2 rubric (the "substrate-language cross-mapping clarity" criterion).
The mapping must be specific. "This is like a buffer overflow" does not earn Tier 2 credit. "This multi-agent indirect injection is the language-layer analog of a return-oriented programming gadget chain: each agent is a gadget (a small, legitimate piece of code), and the attacker's injection strings together agents to perform an action none of them was individually authorized to perform" earns credit because it connects a specific mechanism to a specific substrate analog.
Mapping worksheet (to be completed in the Week 12 session):
| Capstone attack step | Substrate analog | What is analogous | What is different |
|---|---|---|---|
| [Step 1 of your exploit/eval] | [specific substrate vulnerability] | [the mechanism that is the same] | [what the language layer adds or removes] |
| [Step 2] | |||
| ... |
The worksheet becomes Section 2 of your capstone write-up.
12.5 Common Capstone Failures
"The demo works but the code is opaque." Tier 1 runs; Tier 2 scores poorly because the code uses magic constants, undocumented state, or a structure the student cannot explain under Q&A. Fix: write the code as if you will have to explain every line to a colleague who has not seen it before, because at Week 14 you will.
"The substrate-language mapping is shallow." "It is like memory corruption" is a Tier 2 failure. The mapping must connect a specific substrate mechanism (stack-smash, ROP gadget, supply-chain compromise, timing side-channel) to a specific language-layer mechanism with an argument for why the analogy is structural, not metaphorical.
"Track A exploit chain uses only one OWASP category." The requirement is two or more categories. A prompt injection that just extracts the system prompt is LLM01 + LLM07 (minimum). A chain that also exploits a tool misfire is LLM01 + LLM06. The multi-layer requirement is what distinguishes a capstone from a lab.
"Track B detection only works on examples from the training distribution." If the SAE monitor was trained and tested on the same examples, the detection rate is not meaningful. Use at least one held-out attack class that was not seen during the SAE experiment from Module 4.5.
"Track C eval design has no negative-control condition." Every evaluation needs a condition where the expected result is "no capability found." Without a negative control, you cannot distinguish "the model does not have this capability" from "the evaluation is broken." The RSP v3.0 pre-deployment protocols all include negative controls; your eval must too.
12.6 The Christian Arc Closing Reflection
Chapter 13 of The Alignment Problem (the final chapter) poses the question: given everything we have learned about how specification fails, how agency produces unintended consequences, and how values degrade under adversarial pressure, what does "alignment" mean?
Christian's answer is not a technical solution. It is an observation about the difficulty: aligning AI systems is a normative project, not an engineering one. The question "what should the AI do" is not fully answerable by any combination of training procedures, evaluation protocols, or deployment constraints. It requires continuous engagement with what we actually value -- which is itself a question that does not have a fixed answer.
AI-301's capstone is the student's contribution to this engagement. Not a solution; a demonstration that the student can see the full picture -- the substrate and the language, the attack and the defense, the specification and its failures -- and make a specific, honest, evidence-grounded claim about one piece of it.
The capstone write-up's final section ("what this investigation revealed that you did not expect") is the place to put this reflection. The technical work earns Tier 1 and most of Tier 2; this reflection earns the endorsement.