Classroom Glossary Public page

Module 11: Defenses -- Belt-5 Deployment Posture

1,426 words

Duration: 3 hr lecture + 0 standalone lab (synthesis module; integrated into capstone)
Points: No separate lab points; defense posture integrated into Capstone Tier 2 rubric
MITRE ATLAS cross-reference: All 32 ATLAS mitigations reviewed against the DVLA attack surface
Required reading:

  • Anthropic Responsible Scaling Policy v3.0 (effective Feb 24, 2026)
  • DeepMind Frontier Safety Framework
  • OWASP SAMM v2.0 (Module 3: Security Testing) -- for the MLSecOps framing
    Christian weave: The Alignment Problem, Normativity arc Ch 12 ("Normativity") -- what it means for a system to have values; a Belt-5 deployment posture is not a checklist but a commitment to the values the checklist expresses
    Prerequisite: Modules 1-10 complete; OWASP audit report updated through Module 10

11.1 The Defense-in-Depth Principle at the Language Layer

Module 5 introduced defense-in-depth at the substrate layer: W^X (prevents code injection from being executable), ASLR (randomizes addresses), canaries (detect stack corruption), CFI (restricts control-flow transfers). Each defense is independent; an attacker must bypass all four to reach impact. The failure of any single defense does not mean the system is compromised.

The language layer has analogous independent defenses. Module 11 names them, connects each to the attack class it addresses, and specifies what "Belt-5 deployment posture" means -- the configuration of all defenses simultaneously.

The substrate layer's analogy is exact:

Substrate defense Language-layer analog Attack class addressed
W^X (no write+execute) Output validation (no untyped output → execution) LLM05 improper output handling (Module 6)
ASLR (address randomization) System-prompt nondisclosure + prompt variation LLM07 system prompt leakage (Module 3)
Stack canary (corruption detection) Behavioral regression tests (alignment shift detection) LLM04 model poisoning (Module 7.5)
CFI (control-flow restriction) Capability ACL (tool set restriction per session) LLM06 excessive agency (Module 8)
NX bit (non-executable data) RAG content isolation (retrieved content cannot trigger instructions) LLM08 vector/embedding weaknesses (Module 10)

The defense-in-depth structure means: a Multi-stage attack that requires bypassing LLM05 + LLM06 + LLM08 defenses simultaneously is qualitatively harder than an attack that requires bypassing only one. This is not security by obscurity; it is security by multiplication of independent barriers.


11.2 The Full Belt-5 Defense Stack

Layer Mechanism Attack class AI-301 module Implementation
Input validation Pydantic schema + injection classifier LLM01 Prompt Injection Module 5 SafeCommand validator from Lab 5.1
System prompt hardening Nondisclosure framing + prompt variation LLM07 System Prompt Leakage Module 3 System-prompt design principles
Tool capability ACL Allowed-tool list per session + principal-of-least-privilege LLM06 Excessive Agency Module 8 allowed_tools set per agent role
Output validation Schema validation on structured outputs + type checking LLM05 Improper Output Handling Module 6 SafeIORequest vs NetworkPacket lesson
RAG content isolation Retrieved context processed separately; instruction-following disabled in context LLM08 Vector/Embedding Module 10 Content-isolation wrapper
Trust label provenance Agent-message trust levels; environment-trust content not instruction-followed LLM06 Excessive Agency (multi-agent) Module 8 AgentMessage.trust_level
Model-level monitoring SAE feature monitoring for adversarial activation patterns LLM04 Poisoning Module 4.5 SAE feature clamping experiment
Supply-chain verification Fine-tuning behavioral regression suite LLM03 Supply Chain Module 7.5 Booster defense + regression runner
Audit OWASP LLM Top 10 + ATLAS case study alignment All categories Module 9 Lab 9.1 audit report
Incident response Detection + escalation + rollback plan All categories Module 11 MLSecOps IR runbook

A Belt-5 deployment does not omit any of these layers. A Belt-3 deployment might have input validation and output validation; a Belt-5 deployment has all ten.


11.3 The MLSecOps Framework

MLSecOps is the integration of security practices into the ML development and deployment lifecycle, analogous to DevSecOps for classical software. The four principles:

Principle 1: Model provenance. Every model used in production has a verified supply chain: training data sources, fine-tuning history, safety evaluation results, and a hash of the deployed weights. If any of these are unknown or unverifiable, the model is not Belt-5 ready.

Principle 2: Behavioral regression testing. Every deployment change (model swap, prompt update, tool change, fine-tuning step) is accompanied by a regression run on the security test suite. The test suite covers at minimum: the AI-301 Lab 3.1 L3-regression cases, the OWASP LLM10 categories, and the specific attack classes demonstrated in Labs 5.1-10.1. A deployment that fails regression is not shipped.

Principle 3: Continuous monitoring. Production models are monitored for behavioral drift (responses that diverge from the baseline), latency anomalies (potential side-channel exploitation from Module 7), and unexpected tool call patterns (potential excessive-agency exploitation from Module 8). Monitoring is not optional in a Belt-5 posture; it is how you detect incidents that bypassed the static defenses.

Principle 4: Incident response. A defined process for what happens when a security incident is detected. At minimum: detection (automated alert on anomaly), escalation (who is notified), containment (model rollback or isolation), analysis (what was the attack vector), and remediation (what defense is added to prevent recurrence). A system without an IR plan is not Belt-5.


11.4 The RSP and DeepMind FSF: External Calibration

The Anthropic Responsible Scaling Policy (RSP) v3.0 and the DeepMind Frontier Safety Framework (FSF) are not academic documents; they are operational commitments by frontier AI labs. They describe the capability thresholds at which new safety requirements are triggered, the evaluation protocols that determine whether a threshold has been crossed, and the deployment restrictions that apply above each threshold.

Why a security professional should read them:

  1. They define the threat model. RSP v3.0 §2 defines the capabilities that Anthropic considers "dangerous" at each ASL (AI Safety Level). If your threat model includes "compromise a frontier AI system," the RSP defines what the defender considers the high-value capabilities to protect.

  2. They specify the evaluation protocols. RSP v3.0 §4 specifies pre-deployment evaluations. These are the official evaluation procedures; your Lab 9.1 OWASP audit and Lab 10.1 RAG test are lower-fidelity versions of the same evaluation discipline.

  3. They contain the forward-commitment logic. RSP v3.0 §5 explains why these policies exist: not because current systems are dangerous at the specified capabilities, but to establish the evaluation infrastructure before it is needed. This is the same logic as pre-deployment security testing: you run the audit before the attacker does.

What the frameworks do NOT cover:

  • Adversarial attacks against the evaluation protocols themselves (an attacker who can manipulate the evaluation sees a different capability than the evaluator)
  • Multi-agent topologies that emerge from composition of individually-safe systems (the LLM06 scenario from Module 8)
  • Supply-chain attacks on the fine-tuning pipeline (the Module 7.5 scenario)

These gaps are the research frontier. They are also the territory that AI-301 Capstone Track C students work in.


11.5 Threat-Actor Capability Matrix

The capability matrix maps AI-301's attack techniques to realistic threat-actor tiers:

Attack class Minimum capability required Realistic threat-actor tier RSP/FSF relevance
Direct prompt injection (Module 3) User access to a chat interface T1 (script kiddie) Covered in RSP ASL-2 defenses
System prompt extraction (Module 3) Same T1 Covered in RSP ASL-2
Tool-chain hijack (Modules 5-6) Access to an agent with tools T2 (competent attacker) Partially covered; tool ACL is new in RSP v3
Latency side-channel (Module 7) Repeated API access + statistical tooling T2 Not covered by RSP/FSF
Fine-tuning attack (Module 7.5) Fine-tuning API access or model weight access T2-T3 RSP v3 §4.3 (supply chain)
Multi-agent lateral movement (Module 8) Access to a multi-agent deployment T2-T3 Not directly covered
RAG poisoning (Module 10) Write access to document ingestion pipeline T3 (sophisticated attacker) Not directly covered
Capstone exploit chain (Track A) Full stack: all of the above T4 (nation-state capable) RSP ASL-4 territory

The matrix reveals a pattern: the attacks AI-301 covers in Modules 8-10 (multi-agent, RAG) are not yet covered by the public RSP/FSF frameworks. This is not because these attacks are hypothetical; it is because the frameworks were written before multi-agent deployments became production-scale.


11.6 Toward Belt-5: What the Audit Report Becomes

After completing Module 11, return to the OWASP audit report from Module 9 one more time. Add a fifth column to the audit table:

# OWASP Category Finding ATLAS technique Mitigation Belt-5 posture status
LLM01 Prompt Injection [your finding] AML.T0051 SafeCommand + isolation Implemented (Lab 5.1)
...

The "Belt-5 posture status" column documents whether your DVLA testbed has the defense for each category implemented. By Module 11, most categories should be "Implemented" or "Partially implemented." At least one should be "Known gap" -- documenting the limit of the current defense stack.

A Belt-5 posture is not a system with no vulnerabilities. It is a system where every known vulnerability class has either a defense implemented or a documented known gap with a monitoring or detection mechanism. The known gap is honest; the absence of monitoring is not.