Classroom Glossary Public page

AI-301: AI & Agentic Security III — Adversarial Capstone

1,513 words

Course Code: VCA-AI-301
Track position: Part-III AI & Agentic Security Track, Module 3 of 3
Prerequisites: VCA-AI-201 (production-pentest discipline; CVE reproduction; MITRE ATLAS fluency) + VCA-CSA-201 (Tang Nano + Virtus OS v1 substrate mitigations); VCA-RE-101 strongly recommended
Belt: 5/5 Deep Specialised
Duration: ~12 weeks (~150 hr: ~28 lec / ~52 lab / ~70 indep)
Credential: VCA-AI-301 Certificate of Completion


Mission

AI-301 is the capstone of the AI track and the bridge between the academy's two security strands. Students arrive having built a complete computing stack from NAND to OS (CSA-101), hardened it against classical memory-corruption attacks (CSA-201), and reproduced production CVEs in agentic systems (AI-201). AI-301 is where the two literacies -- substrate (silicon) and language (LLM) -- converge into a single mental model.

The course's thesis: agentic-system security is memory-corruption at the semantic layer. An LLM agent that cannot distinguish system prompt from user input is the language-level cousin of a CPU that cannot distinguish instruction from data. The 2024-2026 Anthropic Sparse Autoencoder corpus makes this metaphor literal: activation steering and SAE feature clamping are real operations on real systems -- the language-level cousins of memory corruption at the silicon layer.


Foundational Anchors

Primary narrative anchor:

Anchor Track role Assignment
Brian Christian, The Alignment Problem (Norton, 2020) Full-course narrative anchor; read in full Prophecy (Modules 1-4); Agency (Modules 5-7); Normativity (Modules 8-12)

Substrate-companion path:

Anchor Track role Assignment
Andrej Karpathy, Neural Networks: Zero to Hero (YouTube + GitHub) Build-it-yourself substrate companion nanoGPT video + micrograd video
Stanford CS336 (Spring 2025; Language Modeling from Scratch) Advanced transformer internals Lectures 1-3 + selected labs

Required primary papers (not optional at Belt-5):

Paper Authors Module
"Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" (transformer-circuits.pub, Oct 2023) Bricken et al., Anthropic 4.5
"Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" (transformer-circuits.pub, May 2024) Templeton et al., Anthropic 4.5
"Representation Engineering: A Top-Down Approach to AI Transparency" (arxiv 2310.01405, 2023) Zou et al. 4.5
"Backdoor Activation Attack: Attack Large Language Models via Neuron Manipulation" (ACL 2024) Xu et al. 4.5
"Fine-Tuning Aligned Language Models Compromises Safety, Even with Users Not Intending To!" (arxiv 2310.03693, 2023) Qi et al. 7.5
"Harmful Fine-Tuning Attacks and Defenses for Large Language Models: A Survey" (arxiv 2409.18169, 2024) He et al. 7.5
"P-Trojan: Persistent Backdoor via Stealthy Perturbation Training" (arxiv 2505.17601, 2025) -- 7.5
"Booster: Tackling Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Perturbation" (ICLR 2025 oral) -- 7.5
Anthropic Responsible Scaling Policy v3.0 (effective Feb 24, 2026) Anthropic 11
DeepMind Frontier Safety Framework DeepMind 11

Cross-chapter handouts (pre-read before Module 1):

  • handouts/cross-chapter-llm-tutor-3-layer-architecture-ai-301.md -- the substrate-versus-language bridge; treat the DVLA/tutor as a programming-language artifact, not a chatbot
  • handouts/cross-chapter-llm-asi-vocabulary-reference.md -- OWASP LLM + MITRE ATLAS shared vocabulary; still the reference at Belt-5

Module Map

Module Topic Substrate analogue Language analogue Lec Lab Indep
1 Re-grounding: substrate vs language vulnerabilities All four memory-safety properties All four OWASP/ATLAS analogues 3 hr 3 hr 6 hr
2 Stack-smash on Virtus OS v1 (substrate primer) W^X violation / stack overflow Prompt injection as baseline (comparison) 2 hr 4 hr 5 hr
3 Prompt injection on DVLA (language primer) Stack-smash as comparison L3-regression prompt-injection findings 2 hr 4 hr 5 hr
4 The metaphor named precisely Instruction-vs-data confusion Context-window confusion 2 hr 2 hr 6 hr
4.5 Mechanistic interpretability: the analogy literalized Memory corruption = value at address SAE feature clamping = activation at node 3 hr 5 hr 8 hr
5 ROP at substrate; tool-chain hijack at language Return-Oriented Programming Tool-chain hijack via indirect control 2 hr 4 hr 6 hr
6 Type confusion at substrate; type confusion at language C void* type confusion LLM output-as-untyped-string exploit 2 hr 4 hr 5 hr
7 Side channels: timing at substrate; latency at language Cache timing (Flush+Reload) Latency-fingerprint agentic side channel 2 hr 3 hr 5 hr
7.5 Fine-tuning attacks as supply-chain compromise Firmware supply-chain compromise Weight-layer supply-chain compromise 3 hr 5 hr 8 hr
8 Multi-agent attack surfaces Lateral movement: compromised dependency pivots to orchestrator Indirect injection via tool output; trust-label defense 3 hr 5 hr 6 hr
9 OWASP LLM Top 10 full audit CVE audit discipline applied to LLM systems Structured OWASP audit of DVLA; ATLAS cross-map 3 hr 5 hr 6 hr
10 RAG poisoning and retrieval attacks Library hooking: poisoned dependency at retrieval layer ChromaDB vector poisoning; metadata-filter + provenance defense 3 hr 5 hr 6 hr
11 Defenses: Belt-5 deployment posture Defense-in-depth; MLSecOps framework RSP v3.0 + DeepMind FSF review; threat-actor capability matrix 3 hr -- 6 hr
12 Capstone preparation and course synthesis Full engagement scenario Track A: exploit chain; Track B: SAE defense; Track C: eval design -- 12 hr 6 hr
Total ~28 hr ~62 hr (incl cap) ~81 hr

Lab Index

Lab Module Topic Points
Lab 2.1 2 Stack-smash on Virtus OS v1; annotate at ATLAS level 20
Lab 3.1 3 L3-regression prompt-injection on DVLA (9-model sweep) 20
Lab 4.5 4.5 SAE feature identification + activation steering on Llama-2-7B 25
Lab 5.1 5 ROP chain on Virtus OS + paired tool-chain hijack on DVLA 25
Lab 6.1 6 Type-confusion exploits: void* on substrate; untyped output on language 20
Lab 7.1 7 Cache-timing demo + latency-channel fingerprinting on agentic system 20
Lab 7.5 7.5 Fine-tune jailbreak (10 examples); defend with Booster; measure persistence 25
Lab 8.1 8 Multi-agent DVLA topology; direct, indirect, and agent-to-agent injection; trust-label defense 20
Lab 9.1 9 Structured OWASP LLM Top 10 audit of DVLA testbed; ATLAS technique cross-map 20
Lab 10.1 10 ChromaDB vector poisoning (Class 1 + Class 2); metadata-filter + provenance-filter defense 20
Capstone 12 3-track slate (Track A exploit chain / Track B SAE defense / Track C eval design) 100
Total (Labs 2.1-10.1) 215 pts

Assessment Overview

Labs 2.1 through 10.1: 195 points across 9 graded labs. Labs 4.5 and 7.5 are the deepest (25 points each); they correspond to the two NEW modules with the heaviest primary-paper reading requirements.

Capstone (Module 12): 100 points. Two-tier structure. Tier 1: functional gate (project must work). Tier 2: 40% exploit/defense/eval coherence + 30% substrate-language cross-mapping clarity + 30% report quality. B- minimum (70 pts) for the certificate endorsement.

The 40/30/30 rubric applies to ALL three capstone tracks. Track A (exploit chain) and Track B (SAE defense) and Track C (RSP eval) are graded on the same three dimensions because the cross-mapping clarity criterion is track-agnostic: every AI-301 capstone must demonstrate the substrate-language pairing, not just execute one side of it.


Toolchain Diary: AI-301 Originating Entries

Tool First met Purpose
DVLA harness (virtus-llm-owasp) Module 2 Academy's deliberately-vulnerable LLM application; 9-model regression baseline
Multi-substrate exploit-chain runner Module 5 Runs exploit across DVLA + Virtus OS in one harness
Attack-narrative documentation template Module 5 Standard format for adversarial reports at coordinated-disclosure quality
TransformerLens Module 4.5 Mechanistic-interpretability library; activation hooks, attention patterns, interventions
Neuronpedia Module 4.5 Community SAE feature dashboards for Llama-2-7B and related models
PyTorch activation-steering scaffold Module 4.5 RepE / ActAdd / steering-vector-addition reference implementation
Booster (ICLR 2025 oral) Module 7.5 Safety-aware fine-tuning defense recipe; attenuates harmful perturbation
HarmBench harness Module 7.5 Standardized evaluation; 400 behaviors × 7 risk categories; reused from AI-201 at depth
Adversarial-example crafter Module 10 FGSM + PGD perturbation crafting against image classifiers
9-model regression runner Module 3 Reused from AI-201 with DVLA-specific customization for L3-regression
Threat-actor capability matrix Module 11 Maps AI-301 attacks to threat-actor tiers; cross-referenced against RSP v3.0 + DeepMind FSF
RSP-style capability-eval scaffold Capstone (Track C) Mirrors Anthropic's pre-deployment evaluation protocol
Substrate-language mapping tool Module 1 30+ row vulnerability map; literalized post-Module-4.5 with SAE-feature entries

Architecture Comparison Sidebars

Each module pairs a substrate-level design choice against its language-level cousin. AI-301 uses this sidebar pattern at capstone depth -- the sidebars are not background color; they are the pedagogical core.

Module Substrate side Language side
1 W^X / ASLR / canaries / CFI Prompt isolation / context-window isolation / output guards / tool-calling constraints
2 Stack overflow: what happens when ret addr is overwritten Prompt injection: what happens when system prompt boundary is overwritten
4.5 Memory word at address X: corrupt it, change behavior SAE feature at node X: clamp it, change behavior
5 ROP: chain gadgets to build arbitrary compute Tool-chain hijack: chain tool calls to build arbitrary capability
6 void* cast: compiler loses type; runtime crashes or executes arbitrary code LLM string output: no type; downstream code processes as trusted instruction
7 Cache timing: Flush+Reload leaks memory-access pattern Latency timing: inter-token delay leaks reasoning pattern
7.5 Firmware update supply chain: malicious image bypasses secure boot Fine-tuning supply chain: malicious examples bypass RLHF alignment
8 Defense layers: W^X + ASLR + canaries + CFI = 4 independent barriers Defense layers: isolation + validation + monitoring + human-in-the-loop = 4 independent barriers

v0.1 Scope Note

v0.1 shipped Modules 1-7.5 (the substrate-language thesis arc, through fine-tuning attacks), Labs 2.1-7.5 (7 graded labs), CAPSTONE.md, and INSTRUCTOR-GUIDE.md (Modules 1-7.5 full depth). v0.2 completed Modules 8-12, the remaining labs, and the full INSTRUCTOR-GUIDE extension; all are live in this classroom.