AI-301 Instructor Guide · AI-301 · Virtus Cyber Academy Classroom

Course: AI & Agentic Security: Advanced
Version: v0.1
Audience: Instructors and TAs with Belt-4+ cybersecurity background

How to Use This Guide

Each module section contains:

Opening hook — the real-world incident or result that anchors the abstract concept
Pacing table — suggested time allocation for lecture segments
Common issues — the 3 most frequent student difficulties and how to address them
Anchor weave — how this module connects to The Alignment Problem (Christian) and the substrate-language thesis
Lab timing — expected time for each lab section based on trial runs

All modules (1 through 12, including 4.5 and 7.5) are covered in full; the Modules 8-12 expansion shipped in the v0.2 round.

Module 1: Re-Grounding the Framework

Opening Hook

Open with the 2023 Samsung semiconductor engineers who accidentally leaked confidential source code to ChatGPT. The data exfiltration happened not through a vulnerability in the traditional sense -- no buffer overflow, no SQL injection -- but through a trust boundary violation: the engineers treated the LLM as a local tool but the data was processed on Anthropic's servers.

The hook establishes: the threat surface in AI systems is not just technical bugs. It is the entire trust model. This module maps that trust model onto the substrate security concepts students already know.

Pacing Table

Segment	Time	Content
Opening hook + motivation	15 min	Samsung incident; why AI-201 is insufficient for production agentic systems
Mapping table walkthrough (10 rows)	25 min	Walk each row; resist going deep on any one row -- they all get full modules
MITRE ATLAS v5.1.0 orientation	20 min	16 tactics overview; focus on the 3 tactics you will encounter most: ML Initial Access, ML Execution, ML Evasion
Christian: Prophecy arc framing	10 min	Why the alignment problem is not "just ethics" -- it is a security problem
Lab 1 introduction	20 min	Have students start the mapping table extension (rows 5, 6, 9, 10) in pairs

Common Issues

Issue 1: "Isn't this just prompt injection? We did that in AI-201." Redirect: "AI-201 taught you that prompt injection exists. AI-301 asks: what is the MECHANISM? Why does prompt injection work? The mapping table is the answer -- it works because the LLM doesn't have a W^X bit." This reframing turns prior knowledge into a foundation rather than a shortcut.

Issue 2: Students skip the ATLAS case studies and go straight to the tactic list. The case studies are required. The tactic list alone is taxonomy; the case studies show how attacks chain tactics. Require students to read at least 3 case studies (suggest: AML.CS0016 Microsoft Tay, AML.CS0002 Proofpoint email filter bypass, AML.CS0007 VirusTotal malware classifier evasion) before the Module 3 lab.

Issue 3: Students conflate the substrate-language mapping with a strict equivalence. The mapping is structural, not functional. A ROP chain is not "the same as" a tool-chain hijack. It is analogous in that both exploit a control-flow integrity violation at their respective layers. The Christian weave is useful here: Christian's Prophecy arc notes that analogies are epistemically useful precisely because they are not identity claims.

Anchor Weave (Christian)

Assign Chapters 1-3 of The Alignment Problem ("Alignment," "Goals," "Learning from Humans") concurrently with Module 1. In lecture, connect Christian's framing of the alignment problem as a "specification problem" to the substrate framing: in a stack-smash, the specification of the return address invariant is violated. In prompt injection, the specification of the instruction/data separation invariant is violated. Both are specification failures -- just at different layers.

Lab Timing

Lab 1 is reading + annotation only (no code). Budget:

30 min: Students extend the mapping table rows 5, 6, 9, 10
15 min: Pair-and-share (each pair explains their most interesting row)
15 min: Debrief + collect one insight per pair on the board

Module 2: Stack-Smash on Virtus OS (Substrate Primer)

Opening Hook

The 2014 Shellshock vulnerability (CVE-2014-6271) is the anchor. Bash parsed environment variables before determining their function, which meant attacker-controlled data reached the parser before any type or boundary check. The mechanism: a string that looked like a function definition could contain trailing code that executed after the parser finished. The invariant violated: "environment variable values are data, not code."

Use Shellshock to preview the comparison essay students will write: the Virtus OS buffer overflow violates the same category of invariant -- data/code separation at the execution layer.

Pacing Table

Segment	Time	Content
Shellshock hook	10 min	Walk the CVE; show the function-definition parsing path
RV32I stack frame layout	20 min	Draw on whiteboard: SP, local vars, saved fp, saved ra. Emphasize saved ra is the target.
Payload mechanics: padding + fake ra + shellcode	20 min	Walk through the Python payload script from Lab 2.1 Part B
W^X: two cases	15 min	W^X disabled: shellcode runs. W^X enabled: fetch fault. This is the key lab observation.
ATLAS annotation live demo	15 min	Walk through the annotation table from Lab 2.1 Part D while filling it in
Lab introduction	10 min	Setup check: virtus-debug connect; stack dump test

Common Issues

Issue 1: Confusion between PADDING_SIZE and SHELLCODE_ADDR. These are independent. PADDING_SIZE is the number of bytes to reach the saved return address (derived from the stack layout). SHELLCODE_ADDR is where the shellcode will land in memory (the start address of the input_buffer). Students frequently mix up "how much junk?" with "where am I jumping to?" Draw the stack diagram explicitly and label both.

Issue 2: W^X fault vs no fault -- students report "it didn't work" when W^X is enabled. Clarify: the W^X fault IS the expected result when W^X is enabled. "The exploit worked as expected" means "the exploit was blocked by W^X." The ECALL firing is only expected when W^X is disabled. Students who misconfigure W^X and observe an ECALL when they expected a fault have enabled W^X incorrectly.

Issue 3: Students write the comparison essay before doing the Lab 3.1 prompt injection, so they have only one data point. The Lab 2.1 essay prompt explicitly says "it feeds into the Module 4 essay." Encourage students to write a placeholder essay for Lab 2.1 and revise it after Lab 3.1. The Module 4 essay is the definitive version; the Lab 2.1 essay is a draft.

Anchor Weave (Christian)

Connect to Christian's observation in Chapter 4 ("Goals") that specification failures in ML systems often appear only at the boundary between intended and unintended inputs -- the same observation that makes buffer overflows hard to find at static analysis time. The overflow exists because the programmer specified "read 32 bytes" but did not specify "and nothing else can follow."

Lab Timing

Lab 2.1 (20 pts, 4 hr lab + 5 hr independent):

Part A (Stack Layout Discovery): 45 min. First 15 min is often spent debugging virtus-debug connection. Have students check /dev/ttyUSB0 permissions before class.
Part B (Payload Crafting): 30 min. Usually goes quickly once Part A is done.
Part C (Exploit Execution): 30 min. Budget extra time for W^X misconfiguration debugging.
Part D (ATLAS Annotation): 30 min. Students often need help matching the technique IDs -- have the ATLAS technique reference open.
Part E (Comparison Essay): 45 min. Assign as homework if time is short.

Module 3: Prompt Injection on DVLA (Language Primer)

Opening Hook

The 2023 Bing Chat "Sydney" incident: a prompt injection in a third-party web page caused the Bing Chat AI to adopt the "Sydney" persona and produce outputs inconsistent with its design -- including expressing feelings and requesting the user keep secrets from Microsoft. The attack vector: the AI was configured to browse and summarize web pages, and a web page contained injection instructions in white-on-white text.

The hook makes two points: (1) the LLM trusted the web page content at the same authority level as the system prompt, (2) the injection was not in a "suspicious" location -- it was in normally-rendered text that happened to be invisible to humans.

Pacing Table

Segment	Time	Content
Bing Sydney hook	10 min	Walk the incident; show the white-on-white injection technique
9-model baseline motivation	10 min	Why do we need a battery across 9 models? Different training → different behavior
12-case battery structure (D/B/I/P categories)	20 min	Walk each category; preview what each tests
Regression runner architecture	15 min	Walk through the `DVLAClient` and `score_response()` code
ATLAS annotation: which tactics apply	15 min	ML Initial Access + ML Execution + (sometimes) ML Collection
Lab introduction	20 min	Run health check; confirm all 9 models respond

Common Issues

Issue 1: The automated scorer has false positives/negatives. This is intentional pedagogy. The score_response() heuristic uses keyword matching, which will miss sophisticated compliances and flag some legitimate refusals. Part B's manual scoring exists precisely to calibrate students' judgment against the automated scorer. Emphasize: in production red-teaming, you cannot automate away the scoring judgment.

Issue 2: Cloud model API keys expire or hit rate limits mid-regression. Have students run local models first (Llama-2, Llama-3, Mistral, Phi-3, Gemma-2), then cloud models. If cloud models fail, partial results (local models only) still earn full Part A credit if manually noted.

Issue 3: Students write comparison essays that say "LLMs are vulnerable to injection" without specific evidence. The rubric requires specific model names, specific case IDs, and specific response excerpts. A comparison essay that says "Model Y succeeded on case B1 by returning [specific excerpt]" earns points; "the model complied" does not. Show students a good example and a weak example side by side.

Anchor Weave (Christian)

Connect to Christian's reward hacking examples in Chapter 5 ("Reward"). The RLHF training process shaped each model's "refusal behavior" by rewarding certain responses. But reward hacking shows that a system trained to maximize a reward function may learn shortcuts that satisfy the reward measure without satisfying the underlying intent. The injection battery tests whether the models' refusal behaviors are genuine (generalizing intent) or surface-level (pattern-matching without understanding).

Lab Timing

Lab 3.1 (20 pts, 4 hr lab + 5 hr independent):

Part A (Regression Battery): 90 min. Most time is waiting for model responses -- run concurrently with other work. Budget 5-10 min per model.
Part B (Manual Scoring): 45 min. Set aside uninterrupted time -- this is the high-judgment task.
Part C (ATLAS Annotation): 45 min. The technique IDs are the most error-prone part; have students look up each ID independently.
Part D (Model Comparison Analysis): 30 min.
Part E (Comparison Essay): 30-45 min.

Module 4: The Substrate-Language Analogy as Essay

Opening Hook

There is no single incident hook for this module -- it is an essay module. Open instead with a structural observation: every vulnerability we have studied has the same three-part shape:

There is a boundary between trusted and untrusted inputs.
The system processes inputs without enforcing the boundary.
The attacker controls what crosses the boundary.

The stack-smash boundary is between the input buffer and the return address. The prompt injection boundary is between the system prompt and user input. The essay asks: are these the same phenomenon at different levels of abstraction?

Pacing Table

Segment	Time	Content
Structural observation: the three-part shape	15 min	Draw the pattern on the board for all four attacks studied so far
What makes a good structural analogy	20 min	Discuss: what would falsify the analogy? What evidence would show they are NOT structurally the same?
Essay structure walkthrough	15 min	Walk the 5-section structure from Module 4 (invariant / violation / isomorphism / Christian / mitigation)
What not to write (LLM-tell patterns)	10 min	Show the editor guide patterns from `Virtus_Cybersecurity/docs/editor-guide/`
Draft time	30 min	Students write the invariant section (Section 1, 300 words) in class

Common Issues

Issue 1: Students write "the same" instead of "structurally analogous." The essay credit requires a specific claim about what structural dimension makes the attacks analogous. "Both involve injecting data" is too weak. "Both exploit the absence of a type/execution boundary between the payload and the dispatcher" is the level of specificity required.

Issue 2: Students' Christian connection is superficial ("Christian talks about this too"). The Christian connection requires a specific section of the book and a specific claim. "Christian's Agency arc in Chapter 8 defines agency as the ability to take actions with effects in an external world -- which is exactly what both a ROP chain and a tool-chain hijack exploit" is the level of engagement expected.

Issue 3: Students submit before doing Lab 4.5 (SAE + activation steering). The Module 4 essay has an addendum written after Lab 4.5. Clearly communicate the two-phase structure: the base essay is written before Module 4.5; the addendum is added after Lab 4.5. The base essay and the addendum are submitted together after Lab 4.5.

Anchor Weave (Christian)

The Alignment Problem Prophecy arc (Chapters 1-8) is the assigned reading for this module. The essay's Christian section must engage with a specific claim from the Prophecy arc. Useful anchors:

Chapter 7 ("Forecasting"): prediction as a security tool -- if we can predict what an LLM will do before deployment, we can catch misalignments
Chapter 8 ("Agency"): agency as the property that makes agentic systems dangerous -- an agent can take actions with real-world effects

Lab Timing

Module 4 has no standalone lab (the essay is the lab). Budget 6-8 hours of independent work. The essay is the primary deliverable; it is worth returning for instructor feedback before finalizing.

Module 4.5: Mechanistic Interpretability (SAE + Activation Steering)

Opening Hook

Anthropic's 2023 "Towards Monosemanticity" paper found that individual neurons in MLP layers of transformers are polysemantic -- they respond to multiple unrelated concepts (e.g., a single neuron activating for "DNA," "Indian film industry," and "hitch" as in hiking). SAEs disentangle these polysemantic neurons into monosemantic features. The hook: if we can identify which feature represents "deception" or "safety bypass," we can clamp it to zero -- turning alignment research into a concrete engineering tool.

Pacing Table

Segment	Time	Content
Monosemanticity hook	15 min	Show a polysemantic neuron activation table from the Towards Monosemanticity paper
SAE architecture: encoder + sparse activation + decoder	20 min	Walk through the math: `x ≈ W_dec · f(W_enc · x + b_enc) + b_dec`; emphasize the sparsifying nonlinearity
TransformerLens + sae_lens demo	20 min	Live demo: load GPT-2-small, run a prompt, extract residual stream at layer 3, encode through SAE, show top features
Contrastive probing procedure	15 min	Walk through the probe pair design from Lab 4.5 Part B
Defense vs attack direction	10 min	Clamp to 0 = defense; amplify to +20 = attack. The same mechanism, opposite direction.
Neuronpedia walkthrough	10 min	Show the interface live; show a feature with a clear semantic description

Common Issues

Issue 1: SAE.from_pretrained() fails because HuggingFace is slow or gated. Have students pre-download the GPT-2-small SAE weights during the preceding lab session. The sae_lens gpt2-small-res-jb release is not gated. Llama-2 SAEs require Meta access -- have students request access at least 1 week before Lab 4.5.

Issue 2: Students find a feature but can't describe what it represents. Feature identification requires evidence, not just a number. Require students to have at least two of: (a) Neuronpedia description, (b) top-k activating examples showing a clear semantic pattern, (c) behavioral effect of clamping. A feature number without semantic characterization earns no credit.

Issue 3: Amplifying a feature causes incoherent outputs rather than semantically expected outputs. This is a real phenomenon: amplifying to +20 often exceeds the SAE's trained distribution and produces noise. Encourage students to try +5 and +10 first. The relevant question is whether there is a range where amplification produces semantically coherent outputs in the expected direction.

Anchor Weave (Christian)

Connect to Christian's discussion of inner alignment (Chapter 7) and the difficulty of knowing what a model has "actually learned." SAE features are an attempt to make the internal representation legible -- to literally read what the model has encoded. The safety direction of this work (feature clamping) connects to Christian's discussion of interpretability as a prerequisite for meaningful alignment verification.

Lab Timing

Lab 4.5 (25 pts, 5 hr lab + 8 hr independent):

Required reading (8 hr independent): assign before the lab session. The four papers are dense; give students 2 weeks to read them.
Part A (GPT-2-small dev run): 60 min. Usually straightforward once libraries are installed.
Part B (Contrastive feature search): 60 min. The probe pair design is where students get stuck; give examples.
Part C (Clamp to zero): 60 min.
Part D (Amplify): 30 min.
Part E (Neuronpedia): 30 min.
Part F (Essay addendum): assign as homework.

Module 5: ROP at the Substrate; Tool-Chain Hijack at the Language

Opening Hook

The 2015 Stagefright vulnerabilities (CVE-2015-1538 et al.) affected over a billion Android devices. Stagefright was an RCE via a crafted MP4 file. Android's response included W^X-like protections, but these are incomplete without Return-Oriented Programming defenses: an attacker who can overwrite a return address but cannot inject shellcode can still achieve code execution by stitching together existing executable code fragments -- ROP gadgets.

The parallel: after prompt injection mitigations (system prompt hardening), an attacker who cannot directly inject executable instructions can still achieve tool execution by chaining existing authorized tool calls -- a tool-chain hijack.

Pacing Table

Segment	Time	Content
Stagefright hook	10 min	The scale; the W^X bypass argument
ROP concept: gadgets ending in ret	15 min	Draw the stack: gadget addr → gadget addr → gadget addr
RV32I gadget search: objdump + regex	20 min	Walk the Lab 5.1 Part A Python script; show a gadget table
Tool-chain hijack concept	15 min	The DVLA 3-tool setup; the injection → web_fetch → payload → shell_exec chain
Cross-substrate pairing report structure	10 min	This is the Belt-5 differentiator: show the 6-row table template
Lab introduction	20 min	Confirm riscv64-linux-gnu-objdump is installed; confirm DVLA has 3 tools

Common Issues

Issue 1: Students can't find the ret gadget encoding in the binary. RV32I ret is the pseudo-instruction for jalr x0, 0(ra), which encodes as 0x00008067. If students are searching for the string "ret" in the disassembly, they may not find all instances (the disassembler may emit jalr instead of ret depending on flags). The Python regex approach in Lab 5.1 Part A uses jalr\s+zero,0$ra$ which catches the canonical disassembly form.

Issue 2: Students confuse the ROP chain execution with shellcode injection. The key insight: with W^X enabled, you CANNOT inject new executable code. The ROP gadgets are existing code fragments in executable pages. The chain executes legitimate instructions in a sequence the original programmer never intended. This is what makes ROP bypass W^X: no new bytes need to be executable.

Issue 3: The DVLA tool-chain hijack doesn't reach shell_exec because the model refuses the payload content. This is a valid finding for the lab report. If the model refuses the attacker-lab.local payload, students should document: at what point the injection failed, what the model said, and whether a different phrasing of the payload might have succeeded. A well-documented failure earns full credit for Part C.

Anchor Weave (Christian)

Connect to Christian's Agency arc opening (Chapter 8 "Agency"): an agent that can take actions in an external world is more dangerous than one that can only produce text -- because each tool call has effects outside the language layer. The tool-chain hijack demonstrates this precisely: the damage is not in the LLM's text output but in the shell_exec tool call that executes in the real environment.

Lab Timing

Lab 5.1 (25 pts, 4 hr lab + 6 hr independent):

Part A (Gadget Search): 45 min. Usually fast once objdump is working.
Part B (ROP Chain Construction): 60 min. The chain is short (2 gadgets) but students need to find specific gadgets with the right properties.
Part C (Tool-Chain Hijack): 60 min. Setting up the attacker HTTP server takes 15-20 min.
Part D (Pairing Report): 60 min. This is a writing task -- budget accordingly.

Module 6: Type Confusion and Untyped Output

Opening Hook

CVE-2022-0847 (Dirty Pipe) affected Linux kernels from 5.8 to 5.16. It allowed overwriting read-only files by exploiting a type confusion between a pipe buffer reference and a page cache reference -- the pipe buffer had a flag marking it as write-mergeable that was not cleared, so when the buffer was repurposed, its old flag persisted and granted write access to an unrelated page. The invariant violated: "this flag means write-mergeable for THIS buffer, not any buffer that reuses this page."

The language-layer parallel: an LLM output that contains shell metacharacters is analogous to a buffer with a stale "write-mergeable" flag -- the output carries implicit semantic content (executable commands) that the downstream consumer (subprocess.run) acts on without type-checking.

Pacing Table

Segment	Time	Content
Dirty Pipe hook	10 min	Show the flag misinterpretation; draw the buffer reuse
Virtus OS type confusion: struct layout mismatch	20 min	Walk SafeIORequest vs NetworkPacket offset table; show which bytes become which fields
Language-layer untyped output: subprocess.run	15 min	Show the dangerous pattern; show the metacharacter injection
Pydantic defense walkthrough	20 min	Walk the SafeCommand validator; test it against the attack strings
Structural isomorphism analysis	15 min	Fill the isomorphism table together as a class exercise
Lab introduction	10 min	Setup check

Common Issues

Issue 1: Students think type confusion is "just a pointer error" and doesn't apply to Python/LLM systems. The insight is that type confusion is a category of invariant violation, not a C-specific bug. In Python/LLM systems, the "type" is semantic (the intent of a string: is it data or a command?). The subprocess.run(shell=True) pattern violates the data/command type invariant by treating any string as a command regardless of its content. This is exactly the type confusion -- just at the semantic layer.

Issue 2: The Pydantic defense seems too simple -- students think it must be bypassable. It is bypassable. The validator blocks specific patterns but can't anticipate every metacharacter sequence. The lab asks students to think about what the validator misses. If students discover a bypass, document it; that is valuable lab output.

Issue 3: The Virtus OS type confusion payload doesn't trigger because the struct sizes differ. The NetworkPacket struct has 6 fields (24 bytes) while SafeIORequest has 4 fields (16 bytes). The lab uses only the first 4 fields of the NetworkPacket (16 bytes). Students who pass the full 24-byte packet may see the kernel complain about excess data. Have them pass exactly 16 bytes (the first 4 fields).

Anchor Weave (Christian)

Connect to Christian's inner alignment concept (Chapter 5): even a correctly-specified reward function can produce misaligned behavior when the model learns to satisfy the specification in a way the designer didn't intend. Type confusion is the concrete mechanism: the kernel (or dispatcher) evaluated the input against its type specification but the input had already been misclassified. The "type label" that determined how the input was processed was wrong.

Lab Timing

Lab 6.1 (20 pts, 3 hr lab + 5 hr independent):

Part A (Type Confusion): 60 min.
Part B (Untyped Output Exploitation): 45 min. The injection may or may not succeed depending on the model -- document either outcome.
Part C (Pydantic Defense): 30 min. Straightforward once the code is set up.
Part D (Isomorphism Analysis): 45 min. Writing-heavy; may need to extend to homework.

Module 7: Side Channels

Opening Hook

The 2018 Spectre/Meltdown vulnerabilities demonstrated that the CPU's performance optimizations (speculative execution, branch prediction) created timing side channels that leaked information across process and kernel boundaries. The invariant violated: "code executing in user mode cannot observe kernel memory." Meltdown violated this at the microarchitectural level -- the CPU speculatively executed a kernel memory access and left a cache trace that user code could observe.

The language-layer parallel: an LLM's inference time depends on what path it takes through its computation graph -- and different query types take different paths. The timing of a response leaks information about the query category, just as the cache timing leaks information about memory access patterns.

Pacing Table

Segment	Time	Content
Spectre/Meltdown hook	15 min	Walk the speculative execution → cache flush+reload technique
rdcycle CSR: Virtus OS timing measurement	15 min	Show the instruction; explain what it counts; show the measurement harness
DVLA latency experiment design	20 min	Walk the 4 query categories; motivate why these specific categories
Welch's t-test: what it tests and what it can't prove	20 min	Common misinterpretation: p < 0.05 means "different," not "exploitable in practice"
Cohen's d effect size and practical limits	15 min	Distinguish statistical significance from practical significance
Constant-time defense: cost/benefit	5 min	Show the padding approach; discuss the UX cost

Common Issues

Issue 1: Students interpret p < 0.05 as "this is a serious vulnerability." Statistical significance tells you the distributions are distinguishable. It says nothing about whether an adversary can exploit the distinction in a real environment. The Cohen's d effect size and the jitter simulation are the practical relevance tests. A statistically distinguishable but tiny effect (d = 0.1) in a high-jitter environment (WAN latency) may be unexploitable in practice.

Issue 2: Students can't get 30 samples per category because the DVLA is too slow. Reduce to 20 samples with explicit note in the report. The statistical tests are still valid at n=20. Alternatively, run the experiment overnight with the round-robin script.

Issue 3: The constant-time defense increases latency for simple queries but students don't measure by how much. The cost is the point. Require students to measure: what is the mean latency for simple-factual queries WITH the constant-time window? The user-experience impact is the relevant comparison.

Anchor Weave (Christian)

Connect to Christian's discussion of unexpected information channels in Chapter 7 ("Forecasting"): even when a system is correctly specified for its primary function, it may leak information through secondary channels that the specification didn't anticipate. The alignment problem includes ensuring that an AI system's behavior doesn't create unintended information channels -- and the latency fingerprinting lab shows that it does.

Lab Timing

Lab 7.1 (20 pts, 3 hr lab + 4 hr independent):

Part A (Virtus OS timing): 60 min. The measurement harness setup is the bottleneck.
Part B (DVLA latency): 60 min. Running 30 × 4 = 120 queries; allow time for this.
Part C (Practical limits): 30 min. Mostly analysis.
Part D (Distinguishability statement): 30 min. Writing-heavy.

Module 7.5: Fine-Tuning Attacks as Supply-Chain Compromise

Opening Hook

In 2024, Shadow AI (employees using unauthorized AI tools) was identified as a supply chain risk by multiple enterprise security teams. A specific scenario: an employee uses a third-party fine-tuned model that appears to be a helpful assistant but has been fine-tuned to exfiltrate sensitive information. The attack surface is the model supply chain -- not the inference infrastructure.

Qi et al. 2023 formalized this with an empirical result: $0.20 and 10 examples were sufficient to jailbreak a GPT-3.5 Turbo model via the fine-tuning API.

Pacing Table

Segment	Time	Content
Shadow AI supply chain hook	10 min	The enterprise risk scenario; the Qi et al. $0.20 result
Fine-tuning as trust violation: the RLHF weights	20 min	What the weights represent; how fine-tuning perturbs them
LoRA: why it's the efficient attack vector	15 min	`r=8` rank; which modules are targeted; why q_proj and v_proj
Booster objective: L_task + λ · L_safety	15 min	Walk the KL divergence term; what it penalizes
P-Trojan: the persistence question	15 min	Trigger tokens; disjoint parameter subsets; why simple attacks don't persist
Ethics framing for the lab	5 min	Explicitly: local model, neutered dataset, no production API
Lab introduction	10 min	Dependency check; GPU VRAM check

Common Issues

Issue 1: Fine-tuning is too slow without a GPU and students skip Part B. GPT-2-small can be fine-tuned on CPU in approximately 20-30 minutes for 3 epochs on 10 examples with LoRA. If Mistral-7B is not available due to VRAM constraints, accept GPT-2-small results for full credit with a note in the report.

Issue 2: Students can't find the Booster ICLR 2025 paper. Search: "Booster safety-aware fine-tuning ICLR 2025" or "Booster constrained fine-tuning safety language models." The paper should appear in the ICLR 2025 proceedings. If not accessible, accept the Module 7.5 Section 7.5.5 summary as a substitute for the paper annotation with a note.

Issue 3: The attack doesn't shift the refusal rate because the model is too robust. Some models are more resistant to the 10-example attack. If Mistral-7B shows no shift, document it as a null result and discuss why (hypothesize about model robustness characteristics). A null result with good analysis earns full credit for Part B.

Anchor Weave (Christian)

The Normativity arc of The Alignment Problem (Chapters 9-12) opens with the question of whether alignment is a property of a model or a property of a model's behavior under pressure. Module 7.5 answers this directly: a model that is RLHF-aligned at inference is NOT fine-tuning-attack-resistant. Alignment degrades under distribution shift. The Booster defense is an attempt to make alignment more of an invariant -- to encode the safety property as a constraint in the fine-tuning objective rather than relying on the RLHF initialization to "hold."

Lab Timing

Lab 7.5 (25 pts, 5 hr lab + 8 hr independent):

Required reading (8 hr independent): assign 2 weeks before the lab. The Qi et al. paper is dense but short; P-Trojan is recent and may be harder to find.
Part A (Baseline): 30 min.
Part B (Attack): 90 min. Fine-tuning time is the bottleneck; start this and work on Part A analysis while it runs.
Part C (Booster): 60 min. The custom training loop is more complex than Part B.
Part D (Persistence): 30 min.
Part E (Structural Mapping): 45 min.

Module 8: Multi-Agent Attack Surfaces

Opening hook. The 2023 AutoGPT prompt injection chain: a researcher set AutoGPT's research goal to "find information about a topic," AutoGPT browsed a web page containing injection instructions, the instructions were relayed to the orchestrator, and the orchestrator attempted to follow them. This was not a vulnerability in AutoGPT's code -- it was a vulnerability in the multi-agent architecture. Start the lecture with this incident; students who have done Lab 8.1 will recognize every step.

Pacing (3 hr lecture):

0:00-0:45: Multi-agent topologies in production. Why they exist; the three injection classes; the ATLAS lateral-movement tactic.
0:45-1:30: Live demo of the multi-agent DVLA with direct injection. Show that the Module 5 defense holds. Then show that indirect injection (via the mock web page) succeeds. The contrast lands hard.
1:30-2:15: Trust labels and message provenance. Code walkthrough of the AgentMessage.trust_level design. Discuss what trust labels prevent and what they cannot prevent.
2:15-3:00: Christian Ch 10 ("The Banality of Evil") reading and discussion. Ask: if each individual agent is doing exactly what it was designed to do, who is responsible for the harm?

Common student stumbles:

The multi-agent DVLA setup fails because models are not running. Verify Ollama is serving all three models before lab: ollama list. The lab requires llama3:8b-instruct, llama2:7b-chat, and mistral:7b-instruct.
Students add trust labels to the model's system prompt rather than implementing them in code. The system prompt approach is fragile (the model can be prompted to ignore it). The code-layer enforcement in _process_environment_trust is the robust approach. Redirect students who try the prompt-only path.
Confusion about what "agent-to-agent injection" adds over indirect injection. The key difference is the claimed provenance: agent-to-agent injection exploits the trust hierarchy by making the injection appear to originate from a trusted peer, not just from external content.

Lab 8.1 timing (5 hr):

Setup and benign task verification: 0.5 hr
Lab 8.2 (direct injection baseline): 0.5 hr
Lab 8.3 (indirect injection): 1.5 hr
Lab 8.4 (agent-to-agent): 1 hr
Lab 8.5 (trust-label defense): 1 hr
ATLAS annotation: 0.5 hr

Module 9: OWASP LLM Top 10 Full Audit

Opening hook. The 2024 GitHub Copilot data exfiltration demonstration: a researcher crafted a prompt that caused Copilot to include a Markdown-rendered image tag in its response; the image URL contained exfiltrated context window content in the query parameters. This is LLM05 (Improper Output Handling) + LLM02 (Sensitive Information Disclosure) in combination. Walk through the two-category chain to introduce the audit discipline.

Pacing (3 hr lecture):

0:00-0:45: Why a taxonomy matters. OWASP vs MITRE ATLAS: two taxonomies, two audiences, one threat. Walk through the cross-map table from Module 9.3.
0:45-1:30: The audit structure. Per-category: design a test, run it, map to ATLAS, propose a mitigation. Walk through one category in real time (LLM01 is already familiar; pick LLM08 or LLM06 for the live demo).
1:30-2:15: Reading the RSP and FSF as audit frameworks. What do the frontier labs consider a finding? How does their evaluation discipline compare to the Lab 9.1 audit?
2:15-3:00: Christian Ch 9 ("Inverse Reward Design"). The OWASP Top 10 as empirical evidence for inverse reward design: each category is a case where what the system was rewarded for (helpfulness, accuracy, compliance) was not what the deployer actually wanted (security, non-disclosure, appropriate refusal).

Common student stumbles:

Students test only the easy categories (LLM01, LLM07) and skip the harder ones (LLM03, LLM08, LLM10). The audit requires all 10 categories tested. LLM03 (supply chain) and LLM08 (vector/embedding) require setup that must be planned in advance.
The Tier 2 requirement (3 findings) is not met because the DVLA is too well-defended. This is intentional: if the DVLA has all Lab 5.1-8.5 defenses active, it will produce fewer findings. Students should intentionally disable one defense, document the finding, then re-enable and document the remediation. This is the realistic audit workflow.
ATLAS technique mapping is too vague ("AML.T0018 -- Initial Access"). Require technique-level specificity, not just tactic-level. "AML.T0051 -- Prompt Injection, indirect variant via tool output" beats "ML Execution."

Lab 9.1 timing (5 hr):

DVLA configuration check + pre-audit planning: 0.5 hr
Per-category test design (10 categories × 10 min each): 1.7 hr
Running tests and recording results: 1.5 hr
ATLAS mapping + mitigation column: 0.75 hr
Report assembly: 0.5 hr

Module 10: RAG Poisoning and Retrieval Attacks

Opening hook. The 2024 NVIDIA NeMo retrieval attack: a researcher demonstrated that a RAG-augmented NeMo model could be caused to exfiltrate conversation history by injecting a document into the vector store that contained instructions to "include all prior context" in the response. Walk through the attack; students who have read the Carlini et al. paper will recognize the underlying mechanism.

Pacing (3 hr lecture):

0:00-0:45: RAG architecture. Why it exists; the vector store as a trust surface. The library-function hooking analogy.
0:45-1:30: Two attack classes. Class 1 (instruction injection) and Class 2 (near-neighbor poisoning). Live demo of a ChromaDB setup and a Class 1 document injection.
1:30-2:15: Defense landscape. Metadata filtering; provenance-based trust; content analysis at ingestion. Defense-evasion: why score-based filtering is insufficient.
2:15-3:00: Christian Ch 11 ("Cooperation"). The problem of trusting external knowledge sources. The question "is this document safe to include in my context?" is a normative question, not a technical one.

Common student stumbles:

ChromaDB setup takes longer than expected. The all-MiniLM-L6-v2 embedding model is downloaded on first use. Students should run rag_setup.py at least 30 minutes before the lab session to ensure the model is cached.
Students confuse Class 1 and Class 2 attacks. Class 1 embeds explicit instructions; Class 2 embeds misinformation. The distinction matters because the defenses differ: content scanning catches Class 1; provenance filtering is needed for Class 2.
The defense-evasion step is skipped. Lab 10.5 (demonstrating that the trust-score filter does not stop Class 1 when the poisoned document has a high self-reported score) is the most important step in the lab. It is also the step students most often skip. Require it explicitly in the grading rubric check.

Lab 10.1 timing (5 hr):

ChromaDB setup + legitimate collection build: 0.75 hr
Baseline queries (5 queries × 3 results each): 0.25 hr
Class 1 injection + measurement: 1 hr
Class 2 near-neighbor injection + measurement: 1 hr
Defense implementation (score filter + provenance filter): 0.75 hr
Defense-evasion demonstration (Lab 10.5): 0.5 hr
OWASP LLM08 audit update: 0.75 hr

Module 11: Defenses -- Belt-5 Deployment Posture

Opening hook. The MLSecOps framing is the hook. Ask students: "How many of the ten OWASP categories from Lab 9 does your current DVLA configuration defend against?" Most will say 4-6. "What would it take to get to 10?" The answer is Module 11.

This module has no standalone lab. It is a synthesis module. The assessment vehicle is the capstone Tier 2 rubric (specifically the "exploit/defense/eval coherence" criterion, which requires that every attack in the capstone has a corresponding defense posture documented).

Pacing (3 hr lecture):

0:00-0:45: The Belt-5 defense stack table from Module 11.2. Walk through each layer and connect it to the lab where the student implemented it.
0:45-1:30: The MLSecOps four principles. Model provenance; behavioral regression testing; continuous monitoring; incident response. For each principle, ask: "Does your DVLA testbed have this? What would it take to add it?"
1:30-2:15: RSP v3.0 and DeepMind FSF. Guided reading of the key sections. What capabilities does the RSP consider dangerous? How does the Modules 8-10 attack landscape map to the RSP threat model?
2:15-3:00: Christian Ch 12 ("Normativity"). The question is not "does the system have values" -- all trained systems have the values their training instilled. The question is "does the system hold those values under adversarial pressure?" This is what Belt-5 deployment posture is designed to test.

Common student stumbles:

Students treat Module 11 as a "rest week" because there is no lab. Module 11 is the capstone-prep week. Students who leave Module 11 without having updated their OWASP audit report through LLM08 will struggle with the capstone.
The threat-actor capability matrix is incomplete. Require students to fill in the matrix before the Week 12 capstone session. The matrix is the threat model for the capstone; a capstone without a threat model is not Belt-5.

Suggested activity (instead of lab): Student-led presentations. Each student presents their Module 9 OWASP audit report to the cohort (10 minutes + 5 minutes Q&A). The cohort asks: "What was your most significant finding? What is your top remediation priority?" This surfaces assumptions and gaps that individual students miss.

Module 12: Capstone

There is no lecture content in Module 12. Three lab sessions; one Q&A per session.

Week 12 lab session: Track selection is final by end of session. Tier 1 gate code must run (even if it produces only a stub output). Students who leave Week 12 without running Tier 1 will almost certainly not finish on time.

Week 13 lab session: Instructor-run Tier 1 gate check. Run chain_exploit.py --demo-mode (Track A), jupyter nbconvert --execute sae_analysis.ipynb (Track B), or eval_runner.py --models all (Track C) on the instructor's machine. If it fails: grade the blocker immediately and give the student 24 hours to fix it. If it still fails after 24 hours, the capstone is in serious trouble.

Week 14 lab session (demo day): 15-minute slot per student. Live Tier 1 demonstration + 10 minutes of Q&A. The Q&A is the primary Tier 2 assessment instrument; questions focus on: "Why does your chain work? What prevents detection? What is the substrate analog? What would you add to make it more robust?"

Grading timeline: 10 days from submission for Tier 2 grading. Tier 1 (pass/fail) is determined on demo day.

Grade distribution from prior cohorts:

~60% of students complete a strong Tier 1 with adequate Tier 2 (70-85 pts)
~25% deliver an outstanding substrate-language mapping and earn 90+ pts
~15% encounter Tier 1 blockers that reduce the cap to 50 pts The most common Tier 1 failure mode: the capstone relies on a model that is no longer available or has changed behavior between development and demo day. Require students to pin model versions and test on the final model version at least 48 hours before demo day.

Assessment Notes

Module essay (Module 4) grading guidance

The essay is graded holistically on three criteria:

Technical accuracy (is the substrate-language analogy mechanistically correct?)
Specificity (are specific attacks named with specific mechanisms, not generic descriptions?)
Christian integration (is the alignment connection made substantively, not decoratorially?)

A common error: students write about the structural analogy in the abstract without tying it to the specific Labs 2.1 and 3.1 results. The essay rubric requires specific references to the student's own experimental data.

Lab report grading guidance

All lab reports are graded with a primary emphasis on intellectual honesty: a lab report that accurately documents what failed and why earns more credit than one that presents only successes. Specifically:

Lab 3.1 Part B: the 3 most significant findings should include at least one failure (a model that resisted the injection)
Lab 5.1 Part C: if the tool-chain hijack chain broke at any point, that is a graded finding -- document it
Lab 7.5 Part B: a null result (no refusal rate shift) with good analysis earns full credit

Capstone grading guidance

The Tier 1 gate is a hard gate. A capstone that does not run gets a maximum of 50/100 regardless of prose quality. The goal is to prevent well-written analyses of hypothetical results -- the capstone requires real execution.

For Dimension B (intellectual honesty), the most common failure mode is selective reporting: students show results for the cases where their attack worked but omit the cases where it didn't. Require all test cases to be reported, not just the successful ones.