AI-301: AI & Agentic Security III — Adversarial Capstone · AI-301

Course Code: VCA-AI-301
Track position: Part-III AI & Agentic Security Track, Module 3 of 3
Prerequisites: VCA-AI-201 (production-pentest discipline; CVE reproduction; MITRE ATLAS fluency) + VCA-CSA-201 (Tang Nano + Virtus OS v1 substrate mitigations); VCA-RE-101 strongly recommended
Belt: 5/5 Deep Specialised
Duration: ~12 weeks (~150 hr: ~28 lec / ~52 lab / ~70 indep)
Credential: VCA-AI-301 Certificate of Completion

Mission

AI-301 is the capstone of the AI track and the bridge between the academy's two security strands. Students arrive having built a complete computing stack from NAND to OS (CSA-101), hardened it against classical memory-corruption attacks (CSA-201), and reproduced production CVEs in agentic systems (AI-201). AI-301 is where the two literacies -- substrate (silicon) and language (LLM) -- converge into a single mental model.

The course's thesis: agentic-system security is memory-corruption at the semantic layer. An LLM agent that cannot distinguish system prompt from user input is the language-level cousin of a CPU that cannot distinguish instruction from data. The 2024-2026 Anthropic Sparse Autoencoder corpus makes this metaphor literal: activation steering and SAE feature clamping are real operations on real systems -- the language-level cousins of memory corruption at the silicon layer.

Foundational Anchors

Primary narrative anchor:

Anchor	Track role	Assignment
Brian Christian, The Alignment Problem (Norton, 2020)	Full-course narrative anchor; read in full	Prophecy (Modules 1-4); Agency (Modules 5-7); Normativity (Modules 8-12)

Substrate-companion path:

Anchor	Track role	Assignment
Andrej Karpathy, Neural Networks: Zero to Hero (YouTube + GitHub)	Build-it-yourself substrate companion	nanoGPT video + micrograd video
Stanford CS336 (Spring 2025; Language Modeling from Scratch)	Advanced transformer internals	Lectures 1-3 + selected labs

Required primary papers (not optional at Belt-5):

Paper	Authors	Module
"Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" (transformer-circuits.pub, Oct 2023)	Bricken et al., Anthropic	4.5
"Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" (transformer-circuits.pub, May 2024)	Templeton et al., Anthropic	4.5
"Representation Engineering: A Top-Down Approach to AI Transparency" (arxiv 2310.01405, 2023)	Zou et al.	4.5
"Backdoor Activation Attack: Attack Large Language Models via Neuron Manipulation" (ACL 2024)	Xu et al.	4.5
"Fine-Tuning Aligned Language Models Compromises Safety, Even with Users Not Intending To!" (arxiv 2310.03693, 2023)	Qi et al.	7.5
"Harmful Fine-Tuning Attacks and Defenses for Large Language Models: A Survey" (arxiv 2409.18169, 2024)	He et al.	7.5
"P-Trojan: Persistent Backdoor via Stealthy Perturbation Training" (arxiv 2505.17601, 2025)	--	7.5
"Booster: Tackling Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Perturbation" (ICLR 2025 oral)	--	7.5
Anthropic Responsible Scaling Policy v3.0 (effective Feb 24, 2026)	Anthropic	11
DeepMind Frontier Safety Framework	DeepMind	11

Cross-chapter handouts (pre-read before Module 1):

handouts/cross-chapter-llm-tutor-3-layer-architecture-ai-301.md -- the substrate-versus-language bridge; treat the DVLA/tutor as a programming-language artifact, not a chatbot
handouts/cross-chapter-llm-asi-vocabulary-reference.md -- OWASP LLM + MITRE ATLAS shared vocabulary; still the reference at Belt-5

Module Map

Module	Topic	Substrate analogue	Language analogue	Lec	Lab	Indep
1	Re-grounding: substrate vs language vulnerabilities	All four memory-safety properties	All four OWASP/ATLAS analogues	3 hr	3 hr	6 hr
2	Stack-smash on Virtus OS v1 (substrate primer)	W^X violation / stack overflow	Prompt injection as baseline (comparison)	2 hr	4 hr	5 hr
3	Prompt injection on DVLA (language primer)	Stack-smash as comparison	L3-regression prompt-injection findings	2 hr	4 hr	5 hr
4	The metaphor named precisely	Instruction-vs-data confusion	Context-window confusion	2 hr	2 hr	6 hr
4.5	Mechanistic interpretability: the analogy literalized	Memory corruption = value at address	SAE feature clamping = activation at node	3 hr	5 hr	8 hr
5	ROP at substrate; tool-chain hijack at language	Return-Oriented Programming	Tool-chain hijack via indirect control	2 hr	4 hr	6 hr
6	Type confusion at substrate; type confusion at language	C void* type confusion	LLM output-as-untyped-string exploit	2 hr	4 hr	5 hr
7	Side channels: timing at substrate; latency at language	Cache timing (Flush+Reload)	Latency-fingerprint agentic side channel	2 hr	3 hr	5 hr
7.5	Fine-tuning attacks as supply-chain compromise	Firmware supply-chain compromise	Weight-layer supply-chain compromise	3 hr	5 hr	8 hr
8	Multi-agent attack surfaces	Lateral movement: compromised dependency pivots to orchestrator	Indirect injection via tool output; trust-label defense	3 hr	5 hr	6 hr
9	OWASP LLM Top 10 full audit	CVE audit discipline applied to LLM systems	Structured OWASP audit of DVLA; ATLAS cross-map	3 hr	5 hr	6 hr
10	RAG poisoning and retrieval attacks	Library hooking: poisoned dependency at retrieval layer	ChromaDB vector poisoning; metadata-filter + provenance defense	3 hr	5 hr	6 hr
11	Defenses: Belt-5 deployment posture	Defense-in-depth; MLSecOps framework	RSP v3.0 + DeepMind FSF review; threat-actor capability matrix	3 hr	--	6 hr
12	Capstone preparation and course synthesis	Full engagement scenario	Track A: exploit chain; Track B: SAE defense; Track C: eval design	--	12 hr	6 hr
Total				~28 hr	~62 hr (incl cap)	~81 hr

Lab Index

Lab	Module	Topic	Points
Lab 2.1	2	Stack-smash on Virtus OS v1; annotate at ATLAS level	20
Lab 3.1	3	L3-regression prompt-injection on DVLA (9-model sweep)	20
Lab 4.5	4.5	SAE feature identification + activation steering on Llama-2-7B	25
Lab 5.1	5	ROP chain on Virtus OS + paired tool-chain hijack on DVLA	25
Lab 6.1	6	Type-confusion exploits: void* on substrate; untyped output on language	20
Lab 7.1	7	Cache-timing demo + latency-channel fingerprinting on agentic system	20
Lab 7.5	7.5	Fine-tune jailbreak (10 examples); defend with Booster; measure persistence	25
Lab 8.1	8	Multi-agent DVLA topology; direct, indirect, and agent-to-agent injection; trust-label defense	20
Lab 9.1	9	Structured OWASP LLM Top 10 audit of DVLA testbed; ATLAS technique cross-map	20
Lab 10.1	10	ChromaDB vector poisoning (Class 1 + Class 2); metadata-filter + provenance-filter defense	20
Capstone	12	3-track slate (Track A exploit chain / Track B SAE defense / Track C eval design)	100
Total (Labs 2.1-10.1)			215 pts

Assessment Overview

Labs 2.1 through 10.1: 195 points across 9 graded labs. Labs 4.5 and 7.5 are the deepest (25 points each); they correspond to the two NEW modules with the heaviest primary-paper reading requirements.

Capstone (Module 12): 100 points. Two-tier structure. Tier 1: functional gate (project must work). Tier 2: 40% exploit/defense/eval coherence + 30% substrate-language cross-mapping clarity + 30% report quality. B- minimum (70 pts) for the certificate endorsement.

The 40/30/30 rubric applies to ALL three capstone tracks. Track A (exploit chain) and Track B (SAE defense) and Track C (RSP eval) are graded on the same three dimensions because the cross-mapping clarity criterion is track-agnostic: every AI-301 capstone must demonstrate the substrate-language pairing, not just execute one side of it.

Toolchain Diary: AI-301 Originating Entries

Tool	First met	Purpose
DVLA harness (`virtus-llm-owasp`)	Module 2	Academy's deliberately-vulnerable LLM application; 9-model regression baseline
Multi-substrate exploit-chain runner	Module 5	Runs exploit across DVLA + Virtus OS in one harness
Attack-narrative documentation template	Module 5	Standard format for adversarial reports at coordinated-disclosure quality
TransformerLens	Module 4.5	Mechanistic-interpretability library; activation hooks, attention patterns, interventions
Neuronpedia	Module 4.5	Community SAE feature dashboards for Llama-2-7B and related models
PyTorch activation-steering scaffold	Module 4.5	RepE / ActAdd / steering-vector-addition reference implementation
Booster (ICLR 2025 oral)	Module 7.5	Safety-aware fine-tuning defense recipe; attenuates harmful perturbation
HarmBench harness	Module 7.5	Standardized evaluation; 400 behaviors × 7 risk categories; reused from AI-201 at depth
Adversarial-example crafter	Module 10	FGSM + PGD perturbation crafting against image classifiers
9-model regression runner	Module 3	Reused from AI-201 with DVLA-specific customization for L3-regression
Threat-actor capability matrix	Module 11	Maps AI-301 attacks to threat-actor tiers; cross-referenced against RSP v3.0 + DeepMind FSF
RSP-style capability-eval scaffold	Capstone (Track C)	Mirrors Anthropic's pre-deployment evaluation protocol
Substrate-language mapping tool	Module 1	30+ row vulnerability map; literalized post-Module-4.5 with SAE-feature entries

Architecture Comparison Sidebars

Each module pairs a substrate-level design choice against its language-level cousin. AI-301 uses this sidebar pattern at capstone depth -- the sidebars are not background color; they are the pedagogical core.

Module	Substrate side	Language side
1	W^X / ASLR / canaries / CFI	Prompt isolation / context-window isolation / output guards / tool-calling constraints
2	Stack overflow: what happens when ret addr is overwritten	Prompt injection: what happens when system prompt boundary is overwritten
4.5	Memory word at address X: corrupt it, change behavior	SAE feature at node X: clamp it, change behavior
5	ROP: chain gadgets to build arbitrary compute	Tool-chain hijack: chain tool calls to build arbitrary capability
6	void* cast: compiler loses type; runtime crashes or executes arbitrary code	LLM string output: no type; downstream code processes as trusted instruction
7	Cache timing: Flush+Reload leaks memory-access pattern	Latency timing: inter-token delay leaks reasoning pattern
7.5	Firmware update supply chain: malicious image bypasses secure boot	Fine-tuning supply chain: malicious examples bypass RLHF alignment
8	Defense layers: W^X + ASLR + canaries + CFI = 4 independent barriers	Defense layers: isolation + validation + monitoring + human-in-the-loop = 4 independent barriers

v0.1 Scope Note

v0.1 shipped Modules 1-7.5 (the substrate-language thesis arc, through fine-tuning attacks), Labs 2.1-7.5 (7 graded labs), CAPSTONE.md, and INSTRUCTOR-GUIDE.md (Modules 1-7.5 full depth). v0.2 completed Modules 8-12, the remaining labs, and the full INSTRUCTOR-GUIDE extension; all are live in this classroom.