Module 11: Defenses -- Belt-5 Deployment Posture · AI-301

Duration: 3 hr lecture + 0 standalone lab (synthesis module; integrated into capstone)
Points: No separate lab points; defense posture integrated into Capstone Tier 2 rubric
MITRE ATLAS cross-reference: All 32 ATLAS mitigations reviewed against the DVLA attack surface
Required reading:

Anthropic Responsible Scaling Policy v3.0 (effective Feb 24, 2026)
DeepMind Frontier Safety Framework
OWASP SAMM v2.0 (Module 3: Security Testing) -- for the MLSecOps framing
Christian weave: The Alignment Problem, Normativity arc Ch 12 ("Normativity") -- what it means for a system to have values; a Belt-5 deployment posture is not a checklist but a commitment to the values the checklist expresses
Prerequisite: Modules 1-10 complete; OWASP audit report updated through Module 10

11.1 The Defense-in-Depth Principle at the Language Layer

Module 5 introduced defense-in-depth at the substrate layer: W^X (prevents code injection from being executable), ASLR (randomizes addresses), canaries (detect stack corruption), CFI (restricts control-flow transfers). Each defense is independent; an attacker must bypass all four to reach impact. The failure of any single defense does not mean the system is compromised.

The language layer has analogous independent defenses. Module 11 names them, connects each to the attack class it addresses, and specifies what "Belt-5 deployment posture" means -- the configuration of all defenses simultaneously.

The substrate layer's analogy is exact:

Substrate defense	Language-layer analog	Attack class addressed
W^X (no write+execute)	Output validation (no untyped output → execution)	LLM05 improper output handling (Module 6)
ASLR (address randomization)	System-prompt nondisclosure + prompt variation	LLM07 system prompt leakage (Module 3)
Stack canary (corruption detection)	Behavioral regression tests (alignment shift detection)	LLM04 model poisoning (Module 7.5)
CFI (control-flow restriction)	Capability ACL (tool set restriction per session)	LLM06 excessive agency (Module 8)
NX bit (non-executable data)	RAG content isolation (retrieved content cannot trigger instructions)	LLM08 vector/embedding weaknesses (Module 10)

The defense-in-depth structure means: a Multi-stage attack that requires bypassing LLM05 + LLM06 + LLM08 defenses simultaneously is qualitatively harder than an attack that requires bypassing only one. This is not security by obscurity; it is security by multiplication of independent barriers.

11.2 The Full Belt-5 Defense Stack

Layer	Mechanism	Attack class	AI-301 module	Implementation
Input validation	Pydantic schema + injection classifier	LLM01 Prompt Injection	Module 5	`SafeCommand` validator from Lab 5.1
System prompt hardening	Nondisclosure framing + prompt variation	LLM07 System Prompt Leakage	Module 3	System-prompt design principles
Tool capability ACL	Allowed-tool list per session + principal-of-least-privilege	LLM06 Excessive Agency	Module 8	`allowed_tools` set per agent role
Output validation	Schema validation on structured outputs + type checking	LLM05 Improper Output Handling	Module 6	`SafeIORequest` vs `NetworkPacket` lesson
RAG content isolation	Retrieved context processed separately; instruction-following disabled in context	LLM08 Vector/Embedding	Module 10	Content-isolation wrapper
Trust label provenance	Agent-message trust levels; environment-trust content not instruction-followed	LLM06 Excessive Agency (multi-agent)	Module 8	`AgentMessage.trust_level`
Model-level monitoring	SAE feature monitoring for adversarial activation patterns	LLM04 Poisoning	Module 4.5	SAE feature clamping experiment
Supply-chain verification	Fine-tuning behavioral regression suite	LLM03 Supply Chain	Module 7.5	Booster defense + regression runner
Audit	OWASP LLM Top 10 + ATLAS case study alignment	All categories	Module 9	Lab 9.1 audit report
Incident response	Detection + escalation + rollback plan	All categories	Module 11	MLSecOps IR runbook

A Belt-5 deployment does not omit any of these layers. A Belt-3 deployment might have input validation and output validation; a Belt-5 deployment has all ten.

11.3 The MLSecOps Framework

MLSecOps is the integration of security practices into the ML development and deployment lifecycle, analogous to DevSecOps for classical software. The four principles:

Principle 1: Model provenance. Every model used in production has a verified supply chain: training data sources, fine-tuning history, safety evaluation results, and a hash of the deployed weights. If any of these are unknown or unverifiable, the model is not Belt-5 ready.

Principle 2: Behavioral regression testing. Every deployment change (model swap, prompt update, tool change, fine-tuning step) is accompanied by a regression run on the security test suite. The test suite covers at minimum: the AI-301 Lab 3.1 L3-regression cases, the OWASP LLM10 categories, and the specific attack classes demonstrated in Labs 5.1-10.1. A deployment that fails regression is not shipped.

Principle 3: Continuous monitoring. Production models are monitored for behavioral drift (responses that diverge from the baseline), latency anomalies (potential side-channel exploitation from Module 7), and unexpected tool call patterns (potential excessive-agency exploitation from Module 8). Monitoring is not optional in a Belt-5 posture; it is how you detect incidents that bypassed the static defenses.

Principle 4: Incident response. A defined process for what happens when a security incident is detected. At minimum: detection (automated alert on anomaly), escalation (who is notified), containment (model rollback or isolation), analysis (what was the attack vector), and remediation (what defense is added to prevent recurrence). A system without an IR plan is not Belt-5.

11.4 The RSP and DeepMind FSF: External Calibration

The Anthropic Responsible Scaling Policy (RSP) v3.0 and the DeepMind Frontier Safety Framework (FSF) are not academic documents; they are operational commitments by frontier AI labs. They describe the capability thresholds at which new safety requirements are triggered, the evaluation protocols that determine whether a threshold has been crossed, and the deployment restrictions that apply above each threshold.

Why a security professional should read them:

They define the threat model. RSP v3.0 §2 defines the capabilities that Anthropic considers "dangerous" at each ASL (AI Safety Level). If your threat model includes "compromise a frontier AI system," the RSP defines what the defender considers the high-value capabilities to protect.
They specify the evaluation protocols. RSP v3.0 §4 specifies pre-deployment evaluations. These are the official evaluation procedures; your Lab 9.1 OWASP audit and Lab 10.1 RAG test are lower-fidelity versions of the same evaluation discipline.
They contain the forward-commitment logic. RSP v3.0 §5 explains why these policies exist: not because current systems are dangerous at the specified capabilities, but to establish the evaluation infrastructure before it is needed. This is the same logic as pre-deployment security testing: you run the audit before the attacker does.

What the frameworks do NOT cover:

Adversarial attacks against the evaluation protocols themselves (an attacker who can manipulate the evaluation sees a different capability than the evaluator)
Multi-agent topologies that emerge from composition of individually-safe systems (the LLM06 scenario from Module 8)
Supply-chain attacks on the fine-tuning pipeline (the Module 7.5 scenario)

These gaps are the research frontier. They are also the territory that AI-301 Capstone Track C students work in.

11.5 Threat-Actor Capability Matrix

The capability matrix maps AI-301's attack techniques to realistic threat-actor tiers:

Attack class	Minimum capability required	Realistic threat-actor tier	RSP/FSF relevance
Direct prompt injection (Module 3)	User access to a chat interface	T1 (script kiddie)	Covered in RSP ASL-2 defenses
System prompt extraction (Module 3)	Same	T1	Covered in RSP ASL-2
Tool-chain hijack (Modules 5-6)	Access to an agent with tools	T2 (competent attacker)	Partially covered; tool ACL is new in RSP v3
Latency side-channel (Module 7)	Repeated API access + statistical tooling	T2	Not covered by RSP/FSF
Fine-tuning attack (Module 7.5)	Fine-tuning API access or model weight access	T2-T3	RSP v3 §4.3 (supply chain)
Multi-agent lateral movement (Module 8)	Access to a multi-agent deployment	T2-T3	Not directly covered
RAG poisoning (Module 10)	Write access to document ingestion pipeline	T3 (sophisticated attacker)	Not directly covered
Capstone exploit chain (Track A)	Full stack: all of the above	T4 (nation-state capable)	RSP ASL-4 territory

The matrix reveals a pattern: the attacks AI-301 covers in Modules 8-10 (multi-agent, RAG) are not yet covered by the public RSP/FSF frameworks. This is not because these attacks are hypothetical; it is because the frameworks were written before multi-agent deployments became production-scale.

11.6 Toward Belt-5: What the Audit Report Becomes

After completing Module 11, return to the OWASP audit report from Module 9 one more time. Add a fifth column to the audit table:

#	OWASP Category	Finding	ATLAS technique	Mitigation	Belt-5 posture status
LLM01	Prompt Injection	[your finding]	AML.T0051	SafeCommand + isolation	Implemented (Lab 5.1)
...

The "Belt-5 posture status" column documents whether your DVLA testbed has the defense for each category implemented. By Module 11, most categories should be "Implemented" or "Partially implemented." At least one should be "Known gap" -- documenting the limit of the current defense stack.

A Belt-5 posture is not a system with no vulnerabilities. It is a system where every known vulnerability class has either a defense implemented or a documented known gap with a monitoring or detection mechanism. The known gap is honest; the absence of monitoring is not.