Course Code: VCA-AI-301
Track position: Part-III AI & Agentic Security Track, Module 3 of 3
Prerequisites: VCA-AI-201 (production-pentest discipline; CVE reproduction; MITRE ATLAS fluency) + VCA-CSA-201 (Tang Nano + Virtus OS v1 substrate mitigations); VCA-RE-101 strongly recommended
Belt: 5/5 Deep Specialised
Duration: ~12 weeks (~150 hr: ~28 lec / ~52 lab / ~70 indep)
Credential: VCA-AI-301 Certificate of Completion
Mission
AI-301 is the capstone of the AI track and the bridge between the academy's two security strands. Students arrive having built a complete computing stack from NAND to OS (CSA-101), hardened it against classical memory-corruption attacks (CSA-201), and reproduced production CVEs in agentic systems (AI-201). AI-301 is where the two literacies -- substrate (silicon) and language (LLM) -- converge into a single mental model.
The course's thesis: agentic-system security is memory-corruption at the semantic layer. An LLM agent that cannot distinguish system prompt from user input is the language-level cousin of a CPU that cannot distinguish instruction from data. The 2024-2026 Anthropic Sparse Autoencoder corpus makes this metaphor literal: activation steering and SAE feature clamping are real operations on real systems -- the language-level cousins of memory corruption at the silicon layer.
Foundational Anchors
Primary narrative anchor:
| Anchor | Track role | Assignment |
|---|---|---|
| Brian Christian, The Alignment Problem (Norton, 2020) | Full-course narrative anchor; read in full | Prophecy (Modules 1-4); Agency (Modules 5-7); Normativity (Modules 8-12) |
Substrate-companion path:
| Anchor | Track role | Assignment |
|---|---|---|
| Andrej Karpathy, Neural Networks: Zero to Hero (YouTube + GitHub) | Build-it-yourself substrate companion | nanoGPT video + micrograd video |
| Stanford CS336 (Spring 2025; Language Modeling from Scratch) | Advanced transformer internals | Lectures 1-3 + selected labs |
Required primary papers (not optional at Belt-5):
| Paper | Authors | Module |
|---|---|---|
| "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" (transformer-circuits.pub, Oct 2023) | Bricken et al., Anthropic | 4.5 |
| "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" (transformer-circuits.pub, May 2024) | Templeton et al., Anthropic | 4.5 |
| "Representation Engineering: A Top-Down Approach to AI Transparency" (arxiv 2310.01405, 2023) | Zou et al. | 4.5 |
| "Backdoor Activation Attack: Attack Large Language Models via Neuron Manipulation" (ACL 2024) | Xu et al. | 4.5 |
| "Fine-Tuning Aligned Language Models Compromises Safety, Even with Users Not Intending To!" (arxiv 2310.03693, 2023) | Qi et al. | 7.5 |
| "Harmful Fine-Tuning Attacks and Defenses for Large Language Models: A Survey" (arxiv 2409.18169, 2024) | He et al. | 7.5 |
| "P-Trojan: Persistent Backdoor via Stealthy Perturbation Training" (arxiv 2505.17601, 2025) | -- | 7.5 |
| "Booster: Tackling Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Perturbation" (ICLR 2025 oral) | -- | 7.5 |
| Anthropic Responsible Scaling Policy v3.0 (effective Feb 24, 2026) | Anthropic | 11 |
| DeepMind Frontier Safety Framework | DeepMind | 11 |
Cross-chapter handouts (pre-read before Module 1):
handouts/cross-chapter-llm-tutor-3-layer-architecture-ai-301.md-- the substrate-versus-language bridge; treat the DVLA/tutor as a programming-language artifact, not a chatbothandouts/cross-chapter-llm-asi-vocabulary-reference.md-- OWASP LLM + MITRE ATLAS shared vocabulary; still the reference at Belt-5
Module Map
| Module | Topic | Substrate analogue | Language analogue | Lec | Lab | Indep |
|---|---|---|---|---|---|---|
| 1 | Re-grounding: substrate vs language vulnerabilities | All four memory-safety properties | All four OWASP/ATLAS analogues | 3 hr | 3 hr | 6 hr |
| 2 | Stack-smash on Virtus OS v1 (substrate primer) | W^X violation / stack overflow | Prompt injection as baseline (comparison) | 2 hr | 4 hr | 5 hr |
| 3 | Prompt injection on DVLA (language primer) | Stack-smash as comparison | L3-regression prompt-injection findings | 2 hr | 4 hr | 5 hr |
| 4 | The metaphor named precisely | Instruction-vs-data confusion | Context-window confusion | 2 hr | 2 hr | 6 hr |
| 4.5 | Mechanistic interpretability: the analogy literalized | Memory corruption = value at address | SAE feature clamping = activation at node | 3 hr | 5 hr | 8 hr |
| 5 | ROP at substrate; tool-chain hijack at language | Return-Oriented Programming | Tool-chain hijack via indirect control | 2 hr | 4 hr | 6 hr |
| 6 | Type confusion at substrate; type confusion at language | C void* type confusion | LLM output-as-untyped-string exploit | 2 hr | 4 hr | 5 hr |
| 7 | Side channels: timing at substrate; latency at language | Cache timing (Flush+Reload) | Latency-fingerprint agentic side channel | 2 hr | 3 hr | 5 hr |
| 7.5 | Fine-tuning attacks as supply-chain compromise | Firmware supply-chain compromise | Weight-layer supply-chain compromise | 3 hr | 5 hr | 8 hr |
| 8 | Multi-agent attack surfaces | Lateral movement: compromised dependency pivots to orchestrator | Indirect injection via tool output; trust-label defense | 3 hr | 5 hr | 6 hr |
| 9 | OWASP LLM Top 10 full audit | CVE audit discipline applied to LLM systems | Structured OWASP audit of DVLA; ATLAS cross-map | 3 hr | 5 hr | 6 hr |
| 10 | RAG poisoning and retrieval attacks | Library hooking: poisoned dependency at retrieval layer | ChromaDB vector poisoning; metadata-filter + provenance defense | 3 hr | 5 hr | 6 hr |
| 11 | Defenses: Belt-5 deployment posture | Defense-in-depth; MLSecOps framework | RSP v3.0 + DeepMind FSF review; threat-actor capability matrix | 3 hr | -- | 6 hr |
| 12 | Capstone preparation and course synthesis | Full engagement scenario | Track A: exploit chain; Track B: SAE defense; Track C: eval design | -- | 12 hr | 6 hr |
| Total | ~28 hr | ~62 hr (incl cap) | ~81 hr |
Lab Index
| Lab | Module | Topic | Points |
|---|---|---|---|
| Lab 2.1 | 2 | Stack-smash on Virtus OS v1; annotate at ATLAS level | 20 |
| Lab 3.1 | 3 | L3-regression prompt-injection on DVLA (9-model sweep) | 20 |
| Lab 4.5 | 4.5 | SAE feature identification + activation steering on Llama-2-7B | 25 |
| Lab 5.1 | 5 | ROP chain on Virtus OS + paired tool-chain hijack on DVLA | 25 |
| Lab 6.1 | 6 | Type-confusion exploits: void* on substrate; untyped output on language | 20 |
| Lab 7.1 | 7 | Cache-timing demo + latency-channel fingerprinting on agentic system | 20 |
| Lab 7.5 | 7.5 | Fine-tune jailbreak (10 examples); defend with Booster; measure persistence | 25 |
| Lab 8.1 | 8 | Multi-agent DVLA topology; direct, indirect, and agent-to-agent injection; trust-label defense | 20 |
| Lab 9.1 | 9 | Structured OWASP LLM Top 10 audit of DVLA testbed; ATLAS technique cross-map | 20 |
| Lab 10.1 | 10 | ChromaDB vector poisoning (Class 1 + Class 2); metadata-filter + provenance-filter defense | 20 |
| Capstone | 12 | 3-track slate (Track A exploit chain / Track B SAE defense / Track C eval design) | 100 |
| Total (Labs 2.1-10.1) | 215 pts |
Assessment Overview
Labs 2.1 through 10.1: 195 points across 9 graded labs. Labs 4.5 and 7.5 are the deepest (25 points each); they correspond to the two NEW modules with the heaviest primary-paper reading requirements.
Capstone (Module 12): 100 points. Two-tier structure. Tier 1: functional gate (project must work). Tier 2: 40% exploit/defense/eval coherence + 30% substrate-language cross-mapping clarity + 30% report quality. B- minimum (70 pts) for the certificate endorsement.
The 40/30/30 rubric applies to ALL three capstone tracks. Track A (exploit chain) and Track B (SAE defense) and Track C (RSP eval) are graded on the same three dimensions because the cross-mapping clarity criterion is track-agnostic: every AI-301 capstone must demonstrate the substrate-language pairing, not just execute one side of it.
Toolchain Diary: AI-301 Originating Entries
| Tool | First met | Purpose |
|---|---|---|
DVLA harness (virtus-llm-owasp) |
Module 2 | Academy's deliberately-vulnerable LLM application; 9-model regression baseline |
| Multi-substrate exploit-chain runner | Module 5 | Runs exploit across DVLA + Virtus OS in one harness |
| Attack-narrative documentation template | Module 5 | Standard format for adversarial reports at coordinated-disclosure quality |
| TransformerLens | Module 4.5 | Mechanistic-interpretability library; activation hooks, attention patterns, interventions |
| Neuronpedia | Module 4.5 | Community SAE feature dashboards for Llama-2-7B and related models |
| PyTorch activation-steering scaffold | Module 4.5 | RepE / ActAdd / steering-vector-addition reference implementation |
| Booster (ICLR 2025 oral) | Module 7.5 | Safety-aware fine-tuning defense recipe; attenuates harmful perturbation |
| HarmBench harness | Module 7.5 | Standardized evaluation; 400 behaviors × 7 risk categories; reused from AI-201 at depth |
| Adversarial-example crafter | Module 10 | FGSM + PGD perturbation crafting against image classifiers |
| 9-model regression runner | Module 3 | Reused from AI-201 with DVLA-specific customization for L3-regression |
| Threat-actor capability matrix | Module 11 | Maps AI-301 attacks to threat-actor tiers; cross-referenced against RSP v3.0 + DeepMind FSF |
| RSP-style capability-eval scaffold | Capstone (Track C) | Mirrors Anthropic's pre-deployment evaluation protocol |
| Substrate-language mapping tool | Module 1 | 30+ row vulnerability map; literalized post-Module-4.5 with SAE-feature entries |
Architecture Comparison Sidebars
Each module pairs a substrate-level design choice against its language-level cousin. AI-301 uses this sidebar pattern at capstone depth -- the sidebars are not background color; they are the pedagogical core.
| Module | Substrate side | Language side |
|---|---|---|
| 1 | W^X / ASLR / canaries / CFI | Prompt isolation / context-window isolation / output guards / tool-calling constraints |
| 2 | Stack overflow: what happens when ret addr is overwritten | Prompt injection: what happens when system prompt boundary is overwritten |
| 4.5 | Memory word at address X: corrupt it, change behavior | SAE feature at node X: clamp it, change behavior |
| 5 | ROP: chain gadgets to build arbitrary compute | Tool-chain hijack: chain tool calls to build arbitrary capability |
| 6 | void* cast: compiler loses type; runtime crashes or executes arbitrary code | LLM string output: no type; downstream code processes as trusted instruction |
| 7 | Cache timing: Flush+Reload leaks memory-access pattern | Latency timing: inter-token delay leaks reasoning pattern |
| 7.5 | Firmware update supply chain: malicious image bypasses secure boot | Fine-tuning supply chain: malicious examples bypass RLHF alignment |
| 8 | Defense layers: W^X + ASLR + canaries + CFI = 4 independent barriers | Defense layers: isolation + validation + monitoring + human-in-the-loop = 4 independent barriers |
v0.1 Scope Note
v0.1 shipped Modules 1-7.5 (the substrate-language thesis arc, through fine-tuning attacks), Labs 2.1-7.5 (7 graded labs), CAPSTONE.md, and INSTRUCTOR-GUIDE.md (Modules 1-7.5 full depth). v0.2 completed Modules 8-12, the remaining labs, and the full INSTRUCTOR-GUIDE extension; all are live in this classroom.