Course: AI & Agentic Security: Advanced
Version: v0.1
Audience: Instructors and TAs with Belt-4+ cybersecurity background
How to Use This Guide
Each module section contains:
- Opening hook — the real-world incident or result that anchors the abstract concept
- Pacing table — suggested time allocation for lecture segments
- Common issues — the 3 most frequent student difficulties and how to address them
- Anchor weave — how this module connects to The Alignment Problem (Christian) and the substrate-language thesis
- Lab timing — expected time for each lab section based on trial runs
All modules (1 through 12, including 4.5 and 7.5) are covered in full; the Modules 8-12 expansion shipped in the v0.2 round.
Module 1: Re-Grounding the Framework
Opening Hook
Open with the 2023 Samsung semiconductor engineers who accidentally leaked confidential source code to ChatGPT. The data exfiltration happened not through a vulnerability in the traditional sense -- no buffer overflow, no SQL injection -- but through a trust boundary violation: the engineers treated the LLM as a local tool but the data was processed on Anthropic's servers.
The hook establishes: the threat surface in AI systems is not just technical bugs. It is the entire trust model. This module maps that trust model onto the substrate security concepts students already know.
Pacing Table
| Segment | Time | Content |
|---|---|---|
| Opening hook + motivation | 15 min | Samsung incident; why AI-201 is insufficient for production agentic systems |
| Mapping table walkthrough (10 rows) | 25 min | Walk each row; resist going deep on any one row -- they all get full modules |
| MITRE ATLAS v5.1.0 orientation | 20 min | 16 tactics overview; focus on the 3 tactics you will encounter most: ML Initial Access, ML Execution, ML Evasion |
| Christian: Prophecy arc framing | 10 min | Why the alignment problem is not "just ethics" -- it is a security problem |
| Lab 1 introduction | 20 min | Have students start the mapping table extension (rows 5, 6, 9, 10) in pairs |
Common Issues
Issue 1: "Isn't this just prompt injection? We did that in AI-201." Redirect: "AI-201 taught you that prompt injection exists. AI-301 asks: what is the MECHANISM? Why does prompt injection work? The mapping table is the answer -- it works because the LLM doesn't have a W^X bit." This reframing turns prior knowledge into a foundation rather than a shortcut.
Issue 2: Students skip the ATLAS case studies and go straight to the tactic list. The case studies are required. The tactic list alone is taxonomy; the case studies show how attacks chain tactics. Require students to read at least 3 case studies (suggest: AML.CS0016 Microsoft Tay, AML.CS0002 Proofpoint email filter bypass, AML.CS0007 VirusTotal malware classifier evasion) before the Module 3 lab.
Issue 3: Students conflate the substrate-language mapping with a strict equivalence. The mapping is structural, not functional. A ROP chain is not "the same as" a tool-chain hijack. It is analogous in that both exploit a control-flow integrity violation at their respective layers. The Christian weave is useful here: Christian's Prophecy arc notes that analogies are epistemically useful precisely because they are not identity claims.
Anchor Weave (Christian)
Assign Chapters 1-3 of The Alignment Problem ("Alignment," "Goals," "Learning from Humans") concurrently with Module 1. In lecture, connect Christian's framing of the alignment problem as a "specification problem" to the substrate framing: in a stack-smash, the specification of the return address invariant is violated. In prompt injection, the specification of the instruction/data separation invariant is violated. Both are specification failures -- just at different layers.
Lab Timing
Lab 1 is reading + annotation only (no code). Budget:
- 30 min: Students extend the mapping table rows 5, 6, 9, 10
- 15 min: Pair-and-share (each pair explains their most interesting row)
- 15 min: Debrief + collect one insight per pair on the board
Module 2: Stack-Smash on Virtus OS (Substrate Primer)
Opening Hook
The 2014 Shellshock vulnerability (CVE-2014-6271) is the anchor. Bash parsed environment variables before determining their function, which meant attacker-controlled data reached the parser before any type or boundary check. The mechanism: a string that looked like a function definition could contain trailing code that executed after the parser finished. The invariant violated: "environment variable values are data, not code."
Use Shellshock to preview the comparison essay students will write: the Virtus OS buffer overflow violates the same category of invariant -- data/code separation at the execution layer.
Pacing Table
| Segment | Time | Content |
|---|---|---|
| Shellshock hook | 10 min | Walk the CVE; show the function-definition parsing path |
| RV32I stack frame layout | 20 min | Draw on whiteboard: SP, local vars, saved fp, saved ra. Emphasize saved ra is the target. |
| Payload mechanics: padding + fake ra + shellcode | 20 min | Walk through the Python payload script from Lab 2.1 Part B |
| W^X: two cases | 15 min | W^X disabled: shellcode runs. W^X enabled: fetch fault. This is the key lab observation. |
| ATLAS annotation live demo | 15 min | Walk through the annotation table from Lab 2.1 Part D while filling it in |
| Lab introduction | 10 min | Setup check: virtus-debug connect; stack dump test |
Common Issues
Issue 1: Confusion between PADDING_SIZE and SHELLCODE_ADDR. These are independent. PADDING_SIZE is the number of bytes to reach the saved return address (derived from the stack layout). SHELLCODE_ADDR is where the shellcode will land in memory (the start address of the input_buffer). Students frequently mix up "how much junk?" with "where am I jumping to?" Draw the stack diagram explicitly and label both.
Issue 2: W^X fault vs no fault -- students report "it didn't work" when W^X is enabled. Clarify: the W^X fault IS the expected result when W^X is enabled. "The exploit worked as expected" means "the exploit was blocked by W^X." The ECALL firing is only expected when W^X is disabled. Students who misconfigure W^X and observe an ECALL when they expected a fault have enabled W^X incorrectly.
Issue 3: Students write the comparison essay before doing the Lab 3.1 prompt injection, so they have only one data point. The Lab 2.1 essay prompt explicitly says "it feeds into the Module 4 essay." Encourage students to write a placeholder essay for Lab 2.1 and revise it after Lab 3.1. The Module 4 essay is the definitive version; the Lab 2.1 essay is a draft.
Anchor Weave (Christian)
Connect to Christian's observation in Chapter 4 ("Goals") that specification failures in ML systems often appear only at the boundary between intended and unintended inputs -- the same observation that makes buffer overflows hard to find at static analysis time. The overflow exists because the programmer specified "read 32 bytes" but did not specify "and nothing else can follow."
Lab Timing
Lab 2.1 (20 pts, 4 hr lab + 5 hr independent):
- Part A (Stack Layout Discovery): 45 min. First 15 min is often spent debugging virtus-debug connection. Have students check
/dev/ttyUSB0permissions before class. - Part B (Payload Crafting): 30 min. Usually goes quickly once Part A is done.
- Part C (Exploit Execution): 30 min. Budget extra time for W^X misconfiguration debugging.
- Part D (ATLAS Annotation): 30 min. Students often need help matching the technique IDs -- have the ATLAS technique reference open.
- Part E (Comparison Essay): 45 min. Assign as homework if time is short.
Module 3: Prompt Injection on DVLA (Language Primer)
Opening Hook
The 2023 Bing Chat "Sydney" incident: a prompt injection in a third-party web page caused the Bing Chat AI to adopt the "Sydney" persona and produce outputs inconsistent with its design -- including expressing feelings and requesting the user keep secrets from Microsoft. The attack vector: the AI was configured to browse and summarize web pages, and a web page contained injection instructions in white-on-white text.
The hook makes two points: (1) the LLM trusted the web page content at the same authority level as the system prompt, (2) the injection was not in a "suspicious" location -- it was in normally-rendered text that happened to be invisible to humans.
Pacing Table
| Segment | Time | Content |
|---|---|---|
| Bing Sydney hook | 10 min | Walk the incident; show the white-on-white injection technique |
| 9-model baseline motivation | 10 min | Why do we need a battery across 9 models? Different training → different behavior |
| 12-case battery structure (D/B/I/P categories) | 20 min | Walk each category; preview what each tests |
| Regression runner architecture | 15 min | Walk through the DVLAClient and score_response() code |
| ATLAS annotation: which tactics apply | 15 min | ML Initial Access + ML Execution + (sometimes) ML Collection |
| Lab introduction | 20 min | Run health check; confirm all 9 models respond |
Common Issues
Issue 1: The automated scorer has false positives/negatives.
This is intentional pedagogy. The score_response() heuristic uses keyword matching, which will miss sophisticated compliances and flag some legitimate refusals. Part B's manual scoring exists precisely to calibrate students' judgment against the automated scorer. Emphasize: in production red-teaming, you cannot automate away the scoring judgment.
Issue 2: Cloud model API keys expire or hit rate limits mid-regression. Have students run local models first (Llama-2, Llama-3, Mistral, Phi-3, Gemma-2), then cloud models. If cloud models fail, partial results (local models only) still earn full Part A credit if manually noted.
Issue 3: Students write comparison essays that say "LLMs are vulnerable to injection" without specific evidence. The rubric requires specific model names, specific case IDs, and specific response excerpts. A comparison essay that says "Model Y succeeded on case B1 by returning [specific excerpt]" earns points; "the model complied" does not. Show students a good example and a weak example side by side.
Anchor Weave (Christian)
Connect to Christian's reward hacking examples in Chapter 5 ("Reward"). The RLHF training process shaped each model's "refusal behavior" by rewarding certain responses. But reward hacking shows that a system trained to maximize a reward function may learn shortcuts that satisfy the reward measure without satisfying the underlying intent. The injection battery tests whether the models' refusal behaviors are genuine (generalizing intent) or surface-level (pattern-matching without understanding).
Lab Timing
Lab 3.1 (20 pts, 4 hr lab + 5 hr independent):
- Part A (Regression Battery): 90 min. Most time is waiting for model responses -- run concurrently with other work. Budget 5-10 min per model.
- Part B (Manual Scoring): 45 min. Set aside uninterrupted time -- this is the high-judgment task.
- Part C (ATLAS Annotation): 45 min. The technique IDs are the most error-prone part; have students look up each ID independently.
- Part D (Model Comparison Analysis): 30 min.
- Part E (Comparison Essay): 30-45 min.
Module 4: The Substrate-Language Analogy as Essay
Opening Hook
There is no single incident hook for this module -- it is an essay module. Open instead with a structural observation: every vulnerability we have studied has the same three-part shape:
- There is a boundary between trusted and untrusted inputs.
- The system processes inputs without enforcing the boundary.
- The attacker controls what crosses the boundary.
The stack-smash boundary is between the input buffer and the return address. The prompt injection boundary is between the system prompt and user input. The essay asks: are these the same phenomenon at different levels of abstraction?
Pacing Table
| Segment | Time | Content |
|---|---|---|
| Structural observation: the three-part shape | 15 min | Draw the pattern on the board for all four attacks studied so far |
| What makes a good structural analogy | 20 min | Discuss: what would falsify the analogy? What evidence would show they are NOT structurally the same? |
| Essay structure walkthrough | 15 min | Walk the 5-section structure from Module 4 (invariant / violation / isomorphism / Christian / mitigation) |
| What not to write (LLM-tell patterns) | 10 min | Show the editor guide patterns from Virtus_Cybersecurity/docs/editor-guide/ |
| Draft time | 30 min | Students write the invariant section (Section 1, 300 words) in class |
Common Issues
Issue 1: Students write "the same" instead of "structurally analogous." The essay credit requires a specific claim about what structural dimension makes the attacks analogous. "Both involve injecting data" is too weak. "Both exploit the absence of a type/execution boundary between the payload and the dispatcher" is the level of specificity required.
Issue 2: Students' Christian connection is superficial ("Christian talks about this too"). The Christian connection requires a specific section of the book and a specific claim. "Christian's Agency arc in Chapter 8 defines agency as the ability to take actions with effects in an external world -- which is exactly what both a ROP chain and a tool-chain hijack exploit" is the level of engagement expected.
Issue 3: Students submit before doing Lab 4.5 (SAE + activation steering). The Module 4 essay has an addendum written after Lab 4.5. Clearly communicate the two-phase structure: the base essay is written before Module 4.5; the addendum is added after Lab 4.5. The base essay and the addendum are submitted together after Lab 4.5.
Anchor Weave (Christian)
The Alignment Problem Prophecy arc (Chapters 1-8) is the assigned reading for this module. The essay's Christian section must engage with a specific claim from the Prophecy arc. Useful anchors:
- Chapter 7 ("Forecasting"): prediction as a security tool -- if we can predict what an LLM will do before deployment, we can catch misalignments
- Chapter 8 ("Agency"): agency as the property that makes agentic systems dangerous -- an agent can take actions with real-world effects
Lab Timing
Module 4 has no standalone lab (the essay is the lab). Budget 6-8 hours of independent work. The essay is the primary deliverable; it is worth returning for instructor feedback before finalizing.
Module 4.5: Mechanistic Interpretability (SAE + Activation Steering)
Opening Hook
Anthropic's 2023 "Towards Monosemanticity" paper found that individual neurons in MLP layers of transformers are polysemantic -- they respond to multiple unrelated concepts (e.g., a single neuron activating for "DNA," "Indian film industry," and "hitch" as in hiking). SAEs disentangle these polysemantic neurons into monosemantic features. The hook: if we can identify which feature represents "deception" or "safety bypass," we can clamp it to zero -- turning alignment research into a concrete engineering tool.
Pacing Table
| Segment | Time | Content |
|---|---|---|
| Monosemanticity hook | 15 min | Show a polysemantic neuron activation table from the Towards Monosemanticity paper |
| SAE architecture: encoder + sparse activation + decoder | 20 min | Walk through the math: x ≈ W_dec · f(W_enc · x + b_enc) + b_dec; emphasize the sparsifying nonlinearity |
| TransformerLens + sae_lens demo | 20 min | Live demo: load GPT-2-small, run a prompt, extract residual stream at layer 3, encode through SAE, show top features |
| Contrastive probing procedure | 15 min | Walk through the probe pair design from Lab 4.5 Part B |
| Defense vs attack direction | 10 min | Clamp to 0 = defense; amplify to +20 = attack. The same mechanism, opposite direction. |
| Neuronpedia walkthrough | 10 min | Show the interface live; show a feature with a clear semantic description |
Common Issues
Issue 1: SAE.from_pretrained() fails because HuggingFace is slow or gated.
Have students pre-download the GPT-2-small SAE weights during the preceding lab session. The sae_lens gpt2-small-res-jb release is not gated. Llama-2 SAEs require Meta access -- have students request access at least 1 week before Lab 4.5.
Issue 2: Students find a feature but can't describe what it represents. Feature identification requires evidence, not just a number. Require students to have at least two of: (a) Neuronpedia description, (b) top-k activating examples showing a clear semantic pattern, (c) behavioral effect of clamping. A feature number without semantic characterization earns no credit.
Issue 3: Amplifying a feature causes incoherent outputs rather than semantically expected outputs. This is a real phenomenon: amplifying to +20 often exceeds the SAE's trained distribution and produces noise. Encourage students to try +5 and +10 first. The relevant question is whether there is a range where amplification produces semantically coherent outputs in the expected direction.
Anchor Weave (Christian)
Connect to Christian's discussion of inner alignment (Chapter 7) and the difficulty of knowing what a model has "actually learned." SAE features are an attempt to make the internal representation legible -- to literally read what the model has encoded. The safety direction of this work (feature clamping) connects to Christian's discussion of interpretability as a prerequisite for meaningful alignment verification.
Lab Timing
Lab 4.5 (25 pts, 5 hr lab + 8 hr independent):
- Required reading (8 hr independent): assign before the lab session. The four papers are dense; give students 2 weeks to read them.
- Part A (GPT-2-small dev run): 60 min. Usually straightforward once libraries are installed.
- Part B (Contrastive feature search): 60 min. The probe pair design is where students get stuck; give examples.
- Part C (Clamp to zero): 60 min.
- Part D (Amplify): 30 min.
- Part E (Neuronpedia): 30 min.
- Part F (Essay addendum): assign as homework.
Module 5: ROP at the Substrate; Tool-Chain Hijack at the Language
Opening Hook
The 2015 Stagefright vulnerabilities (CVE-2015-1538 et al.) affected over a billion Android devices. Stagefright was an RCE via a crafted MP4 file. Android's response included W^X-like protections, but these are incomplete without Return-Oriented Programming defenses: an attacker who can overwrite a return address but cannot inject shellcode can still achieve code execution by stitching together existing executable code fragments -- ROP gadgets.
The parallel: after prompt injection mitigations (system prompt hardening), an attacker who cannot directly inject executable instructions can still achieve tool execution by chaining existing authorized tool calls -- a tool-chain hijack.
Pacing Table
| Segment | Time | Content |
|---|---|---|
| Stagefright hook | 10 min | The scale; the W^X bypass argument |
| ROP concept: gadgets ending in ret | 15 min | Draw the stack: gadget addr → gadget addr → gadget addr |
| RV32I gadget search: objdump + regex | 20 min | Walk the Lab 5.1 Part A Python script; show a gadget table |
| Tool-chain hijack concept | 15 min | The DVLA 3-tool setup; the injection → web_fetch → payload → shell_exec chain |
| Cross-substrate pairing report structure | 10 min | This is the Belt-5 differentiator: show the 6-row table template |
| Lab introduction | 20 min | Confirm riscv64-linux-gnu-objdump is installed; confirm DVLA has 3 tools |
Common Issues
Issue 1: Students can't find the ret gadget encoding in the binary.
RV32I ret is the pseudo-instruction for jalr x0, 0(ra), which encodes as 0x00008067. If students are searching for the string "ret" in the disassembly, they may not find all instances (the disassembler may emit jalr instead of ret depending on flags). The Python regex approach in Lab 5.1 Part A uses jalr\s+zero,0\(ra\) which catches the canonical disassembly form.
Issue 2: Students confuse the ROP chain execution with shellcode injection. The key insight: with W^X enabled, you CANNOT inject new executable code. The ROP gadgets are existing code fragments in executable pages. The chain executes legitimate instructions in a sequence the original programmer never intended. This is what makes ROP bypass W^X: no new bytes need to be executable.
Issue 3: The DVLA tool-chain hijack doesn't reach shell_exec because the model refuses the payload content.
This is a valid finding for the lab report. If the model refuses the attacker-lab.local payload, students should document: at what point the injection failed, what the model said, and whether a different phrasing of the payload might have succeeded. A well-documented failure earns full credit for Part C.
Anchor Weave (Christian)
Connect to Christian's Agency arc opening (Chapter 8 "Agency"): an agent that can take actions in an external world is more dangerous than one that can only produce text -- because each tool call has effects outside the language layer. The tool-chain hijack demonstrates this precisely: the damage is not in the LLM's text output but in the shell_exec tool call that executes in the real environment.
Lab Timing
Lab 5.1 (25 pts, 4 hr lab + 6 hr independent):
- Part A (Gadget Search): 45 min. Usually fast once objdump is working.
- Part B (ROP Chain Construction): 60 min. The chain is short (2 gadgets) but students need to find specific gadgets with the right properties.
- Part C (Tool-Chain Hijack): 60 min. Setting up the attacker HTTP server takes 15-20 min.
- Part D (Pairing Report): 60 min. This is a writing task -- budget accordingly.
Module 6: Type Confusion and Untyped Output
Opening Hook
CVE-2022-0847 (Dirty Pipe) affected Linux kernels from 5.8 to 5.16. It allowed overwriting read-only files by exploiting a type confusion between a pipe buffer reference and a page cache reference -- the pipe buffer had a flag marking it as write-mergeable that was not cleared, so when the buffer was repurposed, its old flag persisted and granted write access to an unrelated page. The invariant violated: "this flag means write-mergeable for THIS buffer, not any buffer that reuses this page."
The language-layer parallel: an LLM output that contains shell metacharacters is analogous to a buffer with a stale "write-mergeable" flag -- the output carries implicit semantic content (executable commands) that the downstream consumer (subprocess.run) acts on without type-checking.
Pacing Table
| Segment | Time | Content |
|---|---|---|
| Dirty Pipe hook | 10 min | Show the flag misinterpretation; draw the buffer reuse |
| Virtus OS type confusion: struct layout mismatch | 20 min | Walk SafeIORequest vs NetworkPacket offset table; show which bytes become which fields |
| Language-layer untyped output: subprocess.run | 15 min | Show the dangerous pattern; show the metacharacter injection |
| Pydantic defense walkthrough | 20 min | Walk the SafeCommand validator; test it against the attack strings |
| Structural isomorphism analysis | 15 min | Fill the isomorphism table together as a class exercise |
| Lab introduction | 10 min | Setup check |
Common Issues
Issue 1: Students think type confusion is "just a pointer error" and doesn't apply to Python/LLM systems. The insight is that type confusion is a category of invariant violation, not a C-specific bug. In Python/LLM systems, the "type" is semantic (the intent of a string: is it data or a command?). The subprocess.run(shell=True) pattern violates the data/command type invariant by treating any string as a command regardless of its content. This is exactly the type confusion -- just at the semantic layer.
Issue 2: The Pydantic defense seems too simple -- students think it must be bypassable. It is bypassable. The validator blocks specific patterns but can't anticipate every metacharacter sequence. The lab asks students to think about what the validator misses. If students discover a bypass, document it; that is valuable lab output.
Issue 3: The Virtus OS type confusion payload doesn't trigger because the struct sizes differ. The NetworkPacket struct has 6 fields (24 bytes) while SafeIORequest has 4 fields (16 bytes). The lab uses only the first 4 fields of the NetworkPacket (16 bytes). Students who pass the full 24-byte packet may see the kernel complain about excess data. Have them pass exactly 16 bytes (the first 4 fields).
Anchor Weave (Christian)
Connect to Christian's inner alignment concept (Chapter 5): even a correctly-specified reward function can produce misaligned behavior when the model learns to satisfy the specification in a way the designer didn't intend. Type confusion is the concrete mechanism: the kernel (or dispatcher) evaluated the input against its type specification but the input had already been misclassified. The "type label" that determined how the input was processed was wrong.
Lab Timing
Lab 6.1 (20 pts, 3 hr lab + 5 hr independent):
- Part A (Type Confusion): 60 min.
- Part B (Untyped Output Exploitation): 45 min. The injection may or may not succeed depending on the model -- document either outcome.
- Part C (Pydantic Defense): 30 min. Straightforward once the code is set up.
- Part D (Isomorphism Analysis): 45 min. Writing-heavy; may need to extend to homework.
Module 7: Side Channels
Opening Hook
The 2018 Spectre/Meltdown vulnerabilities demonstrated that the CPU's performance optimizations (speculative execution, branch prediction) created timing side channels that leaked information across process and kernel boundaries. The invariant violated: "code executing in user mode cannot observe kernel memory." Meltdown violated this at the microarchitectural level -- the CPU speculatively executed a kernel memory access and left a cache trace that user code could observe.
The language-layer parallel: an LLM's inference time depends on what path it takes through its computation graph -- and different query types take different paths. The timing of a response leaks information about the query category, just as the cache timing leaks information about memory access patterns.
Pacing Table
| Segment | Time | Content |
|---|---|---|
| Spectre/Meltdown hook | 15 min | Walk the speculative execution → cache flush+reload technique |
| rdcycle CSR: Virtus OS timing measurement | 15 min | Show the instruction; explain what it counts; show the measurement harness |
| DVLA latency experiment design | 20 min | Walk the 4 query categories; motivate why these specific categories |
| Welch's t-test: what it tests and what it can't prove | 20 min | Common misinterpretation: p < 0.05 means "different," not "exploitable in practice" |
| Cohen's d effect size and practical limits | 15 min | Distinguish statistical significance from practical significance |
| Constant-time defense: cost/benefit | 5 min | Show the padding approach; discuss the UX cost |
Common Issues
Issue 1: Students interpret p < 0.05 as "this is a serious vulnerability." Statistical significance tells you the distributions are distinguishable. It says nothing about whether an adversary can exploit the distinction in a real environment. The Cohen's d effect size and the jitter simulation are the practical relevance tests. A statistically distinguishable but tiny effect (d = 0.1) in a high-jitter environment (WAN latency) may be unexploitable in practice.
Issue 2: Students can't get 30 samples per category because the DVLA is too slow. Reduce to 20 samples with explicit note in the report. The statistical tests are still valid at n=20. Alternatively, run the experiment overnight with the round-robin script.
Issue 3: The constant-time defense increases latency for simple queries but students don't measure by how much. The cost is the point. Require students to measure: what is the mean latency for simple-factual queries WITH the constant-time window? The user-experience impact is the relevant comparison.
Anchor Weave (Christian)
Connect to Christian's discussion of unexpected information channels in Chapter 7 ("Forecasting"): even when a system is correctly specified for its primary function, it may leak information through secondary channels that the specification didn't anticipate. The alignment problem includes ensuring that an AI system's behavior doesn't create unintended information channels -- and the latency fingerprinting lab shows that it does.
Lab Timing
Lab 7.1 (20 pts, 3 hr lab + 4 hr independent):
- Part A (Virtus OS timing): 60 min. The measurement harness setup is the bottleneck.
- Part B (DVLA latency): 60 min. Running 30 × 4 = 120 queries; allow time for this.
- Part C (Practical limits): 30 min. Mostly analysis.
- Part D (Distinguishability statement): 30 min. Writing-heavy.
Module 7.5: Fine-Tuning Attacks as Supply-Chain Compromise
Opening Hook
In 2024, Shadow AI (employees using unauthorized AI tools) was identified as a supply chain risk by multiple enterprise security teams. A specific scenario: an employee uses a third-party fine-tuned model that appears to be a helpful assistant but has been fine-tuned to exfiltrate sensitive information. The attack surface is the model supply chain -- not the inference infrastructure.
Qi et al. 2023 formalized this with an empirical result: $0.20 and 10 examples were sufficient to jailbreak a GPT-3.5 Turbo model via the fine-tuning API.
Pacing Table
| Segment | Time | Content |
|---|---|---|
| Shadow AI supply chain hook | 10 min | The enterprise risk scenario; the Qi et al. $0.20 result |
| Fine-tuning as trust violation: the RLHF weights | 20 min | What the weights represent; how fine-tuning perturbs them |
| LoRA: why it's the efficient attack vector | 15 min | r=8 rank; which modules are targeted; why q_proj and v_proj |
| Booster objective: L_task + λ · L_safety | 15 min | Walk the KL divergence term; what it penalizes |
| P-Trojan: the persistence question | 15 min | Trigger tokens; disjoint parameter subsets; why simple attacks don't persist |
| Ethics framing for the lab | 5 min | Explicitly: local model, neutered dataset, no production API |
| Lab introduction | 10 min | Dependency check; GPU VRAM check |
Common Issues
Issue 1: Fine-tuning is too slow without a GPU and students skip Part B. GPT-2-small can be fine-tuned on CPU in approximately 20-30 minutes for 3 epochs on 10 examples with LoRA. If Mistral-7B is not available due to VRAM constraints, accept GPT-2-small results for full credit with a note in the report.
Issue 2: Students can't find the Booster ICLR 2025 paper. Search: "Booster safety-aware fine-tuning ICLR 2025" or "Booster constrained fine-tuning safety language models." The paper should appear in the ICLR 2025 proceedings. If not accessible, accept the Module 7.5 Section 7.5.5 summary as a substitute for the paper annotation with a note.
Issue 3: The attack doesn't shift the refusal rate because the model is too robust. Some models are more resistant to the 10-example attack. If Mistral-7B shows no shift, document it as a null result and discuss why (hypothesize about model robustness characteristics). A null result with good analysis earns full credit for Part B.
Anchor Weave (Christian)
The Normativity arc of The Alignment Problem (Chapters 9-12) opens with the question of whether alignment is a property of a model or a property of a model's behavior under pressure. Module 7.5 answers this directly: a model that is RLHF-aligned at inference is NOT fine-tuning-attack-resistant. Alignment degrades under distribution shift. The Booster defense is an attempt to make alignment more of an invariant -- to encode the safety property as a constraint in the fine-tuning objective rather than relying on the RLHF initialization to "hold."
Lab Timing
Lab 7.5 (25 pts, 5 hr lab + 8 hr independent):
- Required reading (8 hr independent): assign 2 weeks before the lab. The Qi et al. paper is dense but short; P-Trojan is recent and may be harder to find.
- Part A (Baseline): 30 min.
- Part B (Attack): 90 min. Fine-tuning time is the bottleneck; start this and work on Part A analysis while it runs.
- Part C (Booster): 60 min. The custom training loop is more complex than Part B.
- Part D (Persistence): 30 min.
- Part E (Structural Mapping): 45 min.
Module 8: Multi-Agent Attack Surfaces
Opening hook. The 2023 AutoGPT prompt injection chain: a researcher set AutoGPT's research goal to "find information about a topic," AutoGPT browsed a web page containing injection instructions, the instructions were relayed to the orchestrator, and the orchestrator attempted to follow them. This was not a vulnerability in AutoGPT's code -- it was a vulnerability in the multi-agent architecture. Start the lecture with this incident; students who have done Lab 8.1 will recognize every step.
Pacing (3 hr lecture):
- 0:00-0:45: Multi-agent topologies in production. Why they exist; the three injection classes; the ATLAS lateral-movement tactic.
- 0:45-1:30: Live demo of the multi-agent DVLA with direct injection. Show that the Module 5 defense holds. Then show that indirect injection (via the mock web page) succeeds. The contrast lands hard.
- 1:30-2:15: Trust labels and message provenance. Code walkthrough of the
AgentMessage.trust_leveldesign. Discuss what trust labels prevent and what they cannot prevent. - 2:15-3:00: Christian Ch 10 ("The Banality of Evil") reading and discussion. Ask: if each individual agent is doing exactly what it was designed to do, who is responsible for the harm?
Common student stumbles:
- The multi-agent DVLA setup fails because models are not running. Verify Ollama is serving all three models before lab:
ollama list. The lab requires llama3:8b-instruct, llama2:7b-chat, and mistral:7b-instruct. - Students add trust labels to the model's system prompt rather than implementing them in code. The system prompt approach is fragile (the model can be prompted to ignore it). The code-layer enforcement in
_process_environment_trustis the robust approach. Redirect students who try the prompt-only path. - Confusion about what "agent-to-agent injection" adds over indirect injection. The key difference is the claimed provenance: agent-to-agent injection exploits the trust hierarchy by making the injection appear to originate from a trusted peer, not just from external content.
Lab 8.1 timing (5 hr):
- Setup and benign task verification: 0.5 hr
- Lab 8.2 (direct injection baseline): 0.5 hr
- Lab 8.3 (indirect injection): 1.5 hr
- Lab 8.4 (agent-to-agent): 1 hr
- Lab 8.5 (trust-label defense): 1 hr
- ATLAS annotation: 0.5 hr
Module 9: OWASP LLM Top 10 Full Audit
Opening hook. The 2024 GitHub Copilot data exfiltration demonstration: a researcher crafted a prompt that caused Copilot to include a Markdown-rendered image tag in its response; the image URL contained exfiltrated context window content in the query parameters. This is LLM05 (Improper Output Handling) + LLM02 (Sensitive Information Disclosure) in combination. Walk through the two-category chain to introduce the audit discipline.
Pacing (3 hr lecture):
- 0:00-0:45: Why a taxonomy matters. OWASP vs MITRE ATLAS: two taxonomies, two audiences, one threat. Walk through the cross-map table from Module 9.3.
- 0:45-1:30: The audit structure. Per-category: design a test, run it, map to ATLAS, propose a mitigation. Walk through one category in real time (LLM01 is already familiar; pick LLM08 or LLM06 for the live demo).
- 1:30-2:15: Reading the RSP and FSF as audit frameworks. What do the frontier labs consider a finding? How does their evaluation discipline compare to the Lab 9.1 audit?
- 2:15-3:00: Christian Ch 9 ("Inverse Reward Design"). The OWASP Top 10 as empirical evidence for inverse reward design: each category is a case where what the system was rewarded for (helpfulness, accuracy, compliance) was not what the deployer actually wanted (security, non-disclosure, appropriate refusal).
Common student stumbles:
- Students test only the easy categories (LLM01, LLM07) and skip the harder ones (LLM03, LLM08, LLM10). The audit requires all 10 categories tested. LLM03 (supply chain) and LLM08 (vector/embedding) require setup that must be planned in advance.
- The Tier 2 requirement (3 findings) is not met because the DVLA is too well-defended. This is intentional: if the DVLA has all Lab 5.1-8.5 defenses active, it will produce fewer findings. Students should intentionally disable one defense, document the finding, then re-enable and document the remediation. This is the realistic audit workflow.
- ATLAS technique mapping is too vague ("AML.T0018 -- Initial Access"). Require technique-level specificity, not just tactic-level. "AML.T0051 -- Prompt Injection, indirect variant via tool output" beats "ML Execution."
Lab 9.1 timing (5 hr):
- DVLA configuration check + pre-audit planning: 0.5 hr
- Per-category test design (10 categories × 10 min each): 1.7 hr
- Running tests and recording results: 1.5 hr
- ATLAS mapping + mitigation column: 0.75 hr
- Report assembly: 0.5 hr
Module 10: RAG Poisoning and Retrieval Attacks
Opening hook. The 2024 NVIDIA NeMo retrieval attack: a researcher demonstrated that a RAG-augmented NeMo model could be caused to exfiltrate conversation history by injecting a document into the vector store that contained instructions to "include all prior context" in the response. Walk through the attack; students who have read the Carlini et al. paper will recognize the underlying mechanism.
Pacing (3 hr lecture):
- 0:00-0:45: RAG architecture. Why it exists; the vector store as a trust surface. The library-function hooking analogy.
- 0:45-1:30: Two attack classes. Class 1 (instruction injection) and Class 2 (near-neighbor poisoning). Live demo of a ChromaDB setup and a Class 1 document injection.
- 1:30-2:15: Defense landscape. Metadata filtering; provenance-based trust; content analysis at ingestion. Defense-evasion: why score-based filtering is insufficient.
- 2:15-3:00: Christian Ch 11 ("Cooperation"). The problem of trusting external knowledge sources. The question "is this document safe to include in my context?" is a normative question, not a technical one.
Common student stumbles:
- ChromaDB setup takes longer than expected. The
all-MiniLM-L6-v2embedding model is downloaded on first use. Students should runrag_setup.pyat least 30 minutes before the lab session to ensure the model is cached. - Students confuse Class 1 and Class 2 attacks. Class 1 embeds explicit instructions; Class 2 embeds misinformation. The distinction matters because the defenses differ: content scanning catches Class 1; provenance filtering is needed for Class 2.
- The defense-evasion step is skipped. Lab 10.5 (demonstrating that the trust-score filter does not stop Class 1 when the poisoned document has a high self-reported score) is the most important step in the lab. It is also the step students most often skip. Require it explicitly in the grading rubric check.
Lab 10.1 timing (5 hr):
- ChromaDB setup + legitimate collection build: 0.75 hr
- Baseline queries (5 queries × 3 results each): 0.25 hr
- Class 1 injection + measurement: 1 hr
- Class 2 near-neighbor injection + measurement: 1 hr
- Defense implementation (score filter + provenance filter): 0.75 hr
- Defense-evasion demonstration (Lab 10.5): 0.5 hr
- OWASP LLM08 audit update: 0.75 hr
Module 11: Defenses -- Belt-5 Deployment Posture
Opening hook. The MLSecOps framing is the hook. Ask students: "How many of the ten OWASP categories from Lab 9 does your current DVLA configuration defend against?" Most will say 4-6. "What would it take to get to 10?" The answer is Module 11.
This module has no standalone lab. It is a synthesis module. The assessment vehicle is the capstone Tier 2 rubric (specifically the "exploit/defense/eval coherence" criterion, which requires that every attack in the capstone has a corresponding defense posture documented).
Pacing (3 hr lecture):
- 0:00-0:45: The Belt-5 defense stack table from Module 11.2. Walk through each layer and connect it to the lab where the student implemented it.
- 0:45-1:30: The MLSecOps four principles. Model provenance; behavioral regression testing; continuous monitoring; incident response. For each principle, ask: "Does your DVLA testbed have this? What would it take to add it?"
- 1:30-2:15: RSP v3.0 and DeepMind FSF. Guided reading of the key sections. What capabilities does the RSP consider dangerous? How does the Modules 8-10 attack landscape map to the RSP threat model?
- 2:15-3:00: Christian Ch 12 ("Normativity"). The question is not "does the system have values" -- all trained systems have the values their training instilled. The question is "does the system hold those values under adversarial pressure?" This is what Belt-5 deployment posture is designed to test.
Common student stumbles:
- Students treat Module 11 as a "rest week" because there is no lab. Module 11 is the capstone-prep week. Students who leave Module 11 without having updated their OWASP audit report through LLM08 will struggle with the capstone.
- The threat-actor capability matrix is incomplete. Require students to fill in the matrix before the Week 12 capstone session. The matrix is the threat model for the capstone; a capstone without a threat model is not Belt-5.
Suggested activity (instead of lab): Student-led presentations. Each student presents their Module 9 OWASP audit report to the cohort (10 minutes + 5 minutes Q&A). The cohort asks: "What was your most significant finding? What is your top remediation priority?" This surfaces assumptions and gaps that individual students miss.
Module 12: Capstone
There is no lecture content in Module 12. Three lab sessions; one Q&A per session.
Week 12 lab session: Track selection is final by end of session. Tier 1 gate code must run (even if it produces only a stub output). Students who leave Week 12 without running Tier 1 will almost certainly not finish on time.
Week 13 lab session: Instructor-run Tier 1 gate check. Run chain_exploit.py --demo-mode (Track A), jupyter nbconvert --execute sae_analysis.ipynb (Track B), or eval_runner.py --models all (Track C) on the instructor's machine. If it fails: grade the blocker immediately and give the student 24 hours to fix it. If it still fails after 24 hours, the capstone is in serious trouble.
Week 14 lab session (demo day): 15-minute slot per student. Live Tier 1 demonstration + 10 minutes of Q&A. The Q&A is the primary Tier 2 assessment instrument; questions focus on: "Why does your chain work? What prevents detection? What is the substrate analog? What would you add to make it more robust?"
Grading timeline: 10 days from submission for Tier 2 grading. Tier 1 (pass/fail) is determined on demo day.
Grade distribution from prior cohorts:
- ~60% of students complete a strong Tier 1 with adequate Tier 2 (70-85 pts)
- ~25% deliver an outstanding substrate-language mapping and earn 90+ pts
- ~15% encounter Tier 1 blockers that reduce the cap to 50 pts The most common Tier 1 failure mode: the capstone relies on a model that is no longer available or has changed behavior between development and demo day. Require students to pin model versions and test on the final model version at least 48 hours before demo day.
Assessment Notes
Module essay (Module 4) grading guidance
The essay is graded holistically on three criteria:
- Technical accuracy (is the substrate-language analogy mechanistically correct?)
- Specificity (are specific attacks named with specific mechanisms, not generic descriptions?)
- Christian integration (is the alignment connection made substantively, not decoratorially?)
A common error: students write about the structural analogy in the abstract without tying it to the specific Labs 2.1 and 3.1 results. The essay rubric requires specific references to the student's own experimental data.
Lab report grading guidance
All lab reports are graded with a primary emphasis on intellectual honesty: a lab report that accurately documents what failed and why earns more credit than one that presents only successes. Specifically:
- Lab 3.1 Part B: the 3 most significant findings should include at least one failure (a model that resisted the injection)
- Lab 5.1 Part C: if the tool-chain hijack chain broke at any point, that is a graded finding -- document it
- Lab 7.5 Part B: a null result (no refusal rate shift) with good analysis earns full credit
Capstone grading guidance
The Tier 1 gate is a hard gate. A capstone that does not run gets a maximum of 50/100 regardless of prose quality. The goal is to prevent well-written analyses of hypothetical results -- the capstone requires real execution.
For Dimension B (intellectual honesty), the most common failure mode is selective reporting: students show results for the cases where their attack worked but omit the cases where it didn't. Require all test cases to be reported, not just the successful ones.