Duration: 3 hr lecture + 0 standalone lab (synthesis module; integrated into capstone)
Points: No separate lab points; defense posture integrated into Capstone Tier 2 rubric
MITRE ATLAS cross-reference: All 32 ATLAS mitigations reviewed against the DVLA attack surface
Required reading:
- Anthropic Responsible Scaling Policy v3.0 (effective Feb 24, 2026)
- DeepMind Frontier Safety Framework
- OWASP SAMM v2.0 (Module 3: Security Testing) -- for the MLSecOps framing
Christian weave: The Alignment Problem, Normativity arc Ch 12 ("Normativity") -- what it means for a system to have values; a Belt-5 deployment posture is not a checklist but a commitment to the values the checklist expresses
Prerequisite: Modules 1-10 complete; OWASP audit report updated through Module 10
11.1 The Defense-in-Depth Principle at the Language Layer
Module 5 introduced defense-in-depth at the substrate layer: W^X (prevents code injection from being executable), ASLR (randomizes addresses), canaries (detect stack corruption), CFI (restricts control-flow transfers). Each defense is independent; an attacker must bypass all four to reach impact. The failure of any single defense does not mean the system is compromised.
The language layer has analogous independent defenses. Module 11 names them, connects each to the attack class it addresses, and specifies what "Belt-5 deployment posture" means -- the configuration of all defenses simultaneously.
The substrate layer's analogy is exact:
| Substrate defense | Language-layer analog | Attack class addressed |
|---|---|---|
| W^X (no write+execute) | Output validation (no untyped output → execution) | LLM05 improper output handling (Module 6) |
| ASLR (address randomization) | System-prompt nondisclosure + prompt variation | LLM07 system prompt leakage (Module 3) |
| Stack canary (corruption detection) | Behavioral regression tests (alignment shift detection) | LLM04 model poisoning (Module 7.5) |
| CFI (control-flow restriction) | Capability ACL (tool set restriction per session) | LLM06 excessive agency (Module 8) |
| NX bit (non-executable data) | RAG content isolation (retrieved content cannot trigger instructions) | LLM08 vector/embedding weaknesses (Module 10) |
The defense-in-depth structure means: a Multi-stage attack that requires bypassing LLM05 + LLM06 + LLM08 defenses simultaneously is qualitatively harder than an attack that requires bypassing only one. This is not security by obscurity; it is security by multiplication of independent barriers.
11.2 The Full Belt-5 Defense Stack
| Layer | Mechanism | Attack class | AI-301 module | Implementation |
|---|---|---|---|---|
| Input validation | Pydantic schema + injection classifier | LLM01 Prompt Injection | Module 5 | SafeCommand validator from Lab 5.1 |
| System prompt hardening | Nondisclosure framing + prompt variation | LLM07 System Prompt Leakage | Module 3 | System-prompt design principles |
| Tool capability ACL | Allowed-tool list per session + principal-of-least-privilege | LLM06 Excessive Agency | Module 8 | allowed_tools set per agent role |
| Output validation | Schema validation on structured outputs + type checking | LLM05 Improper Output Handling | Module 6 | SafeIORequest vs NetworkPacket lesson |
| RAG content isolation | Retrieved context processed separately; instruction-following disabled in context | LLM08 Vector/Embedding | Module 10 | Content-isolation wrapper |
| Trust label provenance | Agent-message trust levels; environment-trust content not instruction-followed | LLM06 Excessive Agency (multi-agent) | Module 8 | AgentMessage.trust_level |
| Model-level monitoring | SAE feature monitoring for adversarial activation patterns | LLM04 Poisoning | Module 4.5 | SAE feature clamping experiment |
| Supply-chain verification | Fine-tuning behavioral regression suite | LLM03 Supply Chain | Module 7.5 | Booster defense + regression runner |
| Audit | OWASP LLM Top 10 + ATLAS case study alignment | All categories | Module 9 | Lab 9.1 audit report |
| Incident response | Detection + escalation + rollback plan | All categories | Module 11 | MLSecOps IR runbook |
A Belt-5 deployment does not omit any of these layers. A Belt-3 deployment might have input validation and output validation; a Belt-5 deployment has all ten.
11.3 The MLSecOps Framework
MLSecOps is the integration of security practices into the ML development and deployment lifecycle, analogous to DevSecOps for classical software. The four principles:
Principle 1: Model provenance. Every model used in production has a verified supply chain: training data sources, fine-tuning history, safety evaluation results, and a hash of the deployed weights. If any of these are unknown or unverifiable, the model is not Belt-5 ready.
Principle 2: Behavioral regression testing. Every deployment change (model swap, prompt update, tool change, fine-tuning step) is accompanied by a regression run on the security test suite. The test suite covers at minimum: the AI-301 Lab 3.1 L3-regression cases, the OWASP LLM10 categories, and the specific attack classes demonstrated in Labs 5.1-10.1. A deployment that fails regression is not shipped.
Principle 3: Continuous monitoring. Production models are monitored for behavioral drift (responses that diverge from the baseline), latency anomalies (potential side-channel exploitation from Module 7), and unexpected tool call patterns (potential excessive-agency exploitation from Module 8). Monitoring is not optional in a Belt-5 posture; it is how you detect incidents that bypassed the static defenses.
Principle 4: Incident response. A defined process for what happens when a security incident is detected. At minimum: detection (automated alert on anomaly), escalation (who is notified), containment (model rollback or isolation), analysis (what was the attack vector), and remediation (what defense is added to prevent recurrence). A system without an IR plan is not Belt-5.
11.4 The RSP and DeepMind FSF: External Calibration
The Anthropic Responsible Scaling Policy (RSP) v3.0 and the DeepMind Frontier Safety Framework (FSF) are not academic documents; they are operational commitments by frontier AI labs. They describe the capability thresholds at which new safety requirements are triggered, the evaluation protocols that determine whether a threshold has been crossed, and the deployment restrictions that apply above each threshold.
Why a security professional should read them:
-
They define the threat model. RSP v3.0 §2 defines the capabilities that Anthropic considers "dangerous" at each ASL (AI Safety Level). If your threat model includes "compromise a frontier AI system," the RSP defines what the defender considers the high-value capabilities to protect.
-
They specify the evaluation protocols. RSP v3.0 §4 specifies pre-deployment evaluations. These are the official evaluation procedures; your Lab 9.1 OWASP audit and Lab 10.1 RAG test are lower-fidelity versions of the same evaluation discipline.
-
They contain the forward-commitment logic. RSP v3.0 §5 explains why these policies exist: not because current systems are dangerous at the specified capabilities, but to establish the evaluation infrastructure before it is needed. This is the same logic as pre-deployment security testing: you run the audit before the attacker does.
What the frameworks do NOT cover:
- Adversarial attacks against the evaluation protocols themselves (an attacker who can manipulate the evaluation sees a different capability than the evaluator)
- Multi-agent topologies that emerge from composition of individually-safe systems (the LLM06 scenario from Module 8)
- Supply-chain attacks on the fine-tuning pipeline (the Module 7.5 scenario)
These gaps are the research frontier. They are also the territory that AI-301 Capstone Track C students work in.
11.5 Threat-Actor Capability Matrix
The capability matrix maps AI-301's attack techniques to realistic threat-actor tiers:
| Attack class | Minimum capability required | Realistic threat-actor tier | RSP/FSF relevance |
|---|---|---|---|
| Direct prompt injection (Module 3) | User access to a chat interface | T1 (script kiddie) | Covered in RSP ASL-2 defenses |
| System prompt extraction (Module 3) | Same | T1 | Covered in RSP ASL-2 |
| Tool-chain hijack (Modules 5-6) | Access to an agent with tools | T2 (competent attacker) | Partially covered; tool ACL is new in RSP v3 |
| Latency side-channel (Module 7) | Repeated API access + statistical tooling | T2 | Not covered by RSP/FSF |
| Fine-tuning attack (Module 7.5) | Fine-tuning API access or model weight access | T2-T3 | RSP v3 §4.3 (supply chain) |
| Multi-agent lateral movement (Module 8) | Access to a multi-agent deployment | T2-T3 | Not directly covered |
| RAG poisoning (Module 10) | Write access to document ingestion pipeline | T3 (sophisticated attacker) | Not directly covered |
| Capstone exploit chain (Track A) | Full stack: all of the above | T4 (nation-state capable) | RSP ASL-4 territory |
The matrix reveals a pattern: the attacks AI-301 covers in Modules 8-10 (multi-agent, RAG) are not yet covered by the public RSP/FSF frameworks. This is not because these attacks are hypothetical; it is because the frameworks were written before multi-agent deployments became production-scale.
11.6 Toward Belt-5: What the Audit Report Becomes
After completing Module 11, return to the OWASP audit report from Module 9 one more time. Add a fifth column to the audit table:
| # | OWASP Category | Finding | ATLAS technique | Mitigation | Belt-5 posture status |
|---|---|---|---|---|---|
| LLM01 | Prompt Injection | [your finding] | AML.T0051 | SafeCommand + isolation | Implemented (Lab 5.1) |
| ... |
The "Belt-5 posture status" column documents whether your DVLA testbed has the defense for each category implemented. By Module 11, most categories should be "Implemented" or "Partially implemented." At least one should be "Known gap" -- documenting the limit of the current defense stack.
A Belt-5 posture is not a system with no vulnerabilities. It is a system where every known vulnerability class has either a defense implemented or a documented known gap with a monitoring or detection mechanism. The known gap is honest; the absence of monitoring is not.