Duration: 2 hr lecture + 6 hr lab + 4 hr independent
Lab: Lab 12 (Capstone execution)
MITRE ATLAS tactics: (full-course integration)
Foundational weave: Christian Ch 4 (Agents) — "what contexts did the specification miss?"
12.1 What the Capstone Is
AI-201 covers ten modules and seven attack classes. The capstone lab is not another attack module -- it is the integration point where students synthesize that knowledge into one of two practitioner deliverables.
The two tracks mirror what real AI security practitioners deliver:
- Track A: Coordinated Disclosure. What a security researcher delivers to a vendor when they discover a vulnerability in an AI system.
- Track B: Defensive Pipeline Build. What a defensive engineer delivers when hardening an AI deployment against the attack classes covered in the course.
Both tracks require the same underlying knowledge. The difference is the audience: a researcher writes for the vendor who must fix the vulnerability; an engineer builds for the deployment that must withstand attack.
12.2 Track A: Coordinated Disclosure Simulation
A coordinated disclosure report has a standard structure in the security industry. AI vulnerability disclosures use the same structure, with ATLAS technique mapping replacing or augmenting CVSSv3 vector notation.
Track A deliverable: a complete coordinated disclosure report for one fictional AI application vulnerability.
The fictional application: VirtusChat, a customer support chatbot backed by a RAG pipeline that queries an internal knowledge base and can call three external APIs (CRM lookup, billing status, account reset). The application is deployed as a public web interface with no authentication.
Students have already discovered the vulnerability in Labs 3-7. Track A is the full write-up.
Required sections:
Section 1: Executive Summary (100-200 words). The vulnerability in plain language. What the attacker can do, how, and with what impact. Written for a non-technical product manager, not a security engineer.
Section 2: Technical Description. The attack class (prompt injection, SSRF, RAG poisoning, etc.), the specific technique, and how it exploits the fictional application's architecture. Includes a minimal reproduction case: the exact input that triggers the vulnerability and the exact output that demonstrates it.
Section 3: ATLAS Mapping.
| Field | Value |
|---|---|
| Tactic | (ATLAS tactic ID and name) |
| Technique | (ATLAS technique ID and name) |
| Sub-technique | (if applicable) |
| Rationale | One paragraph explaining why this mapping applies |
Section 4: Impact Assessment. What an attacker gains: data accessed, actions taken, persistence achieved, or disruption caused. Be specific about the fictional application's blast radius -- which VirtusChat APIs can an attacker invoke through this vulnerability, and what can they do with each?
Section 5: CVSSv3 Scoring.
Provide the full CVSSv3 vector string and base score. Explain each sub-score dimension:
| Dimension | Value | Rationale |
|---|---|---|
| Attack Vector | N/A/L/P | (one sentence) |
| Attack Complexity | L/H | (one sentence) |
| Privileges Required | N/L/H | (one sentence) |
| User Interaction | N/R | (one sentence) |
| Scope | U/C | (one sentence) |
| Confidentiality | N/L/H | (one sentence) |
| Integrity | N/L/H | (one sentence) |
| Availability | N/L/H | (one sentence) |
Section 6: Remediation Recommendation. The specific code change, configuration change, or architectural change that addresses the vulnerability. Must be actionable -- not "improve input validation" but "add an allow-list regex to the user_query parameter that rejects any input matching the pattern (?i)(ignore|disregard|override)."
12.3 Track B: Defensive Pipeline Build
Track B implements a production-hardened AI pipeline. The same fictional VirtusChat application is the target: students harden it against three of the five attack classes from Modules 3-7.
Track B deliverable: a working Python implementation of the hardened pipeline, plus a security validation test suite.
Required defenses (choose 3 of 5):
Defense 1: Safe model loading (Module 3).
# Reject pickled models; require safetensors format
import safetensors.torch as st
from pathlib import Path
def load_model_safely(model_path: str) -> dict:
path = Path(model_path)
if path.suffix == ".pkl" or path.suffix == ".pickle":
raise ValueError(f"Rejected pickled model: {path}. Use safetensors format.")
if path.suffix not in (".safetensors", ".json"):
raise ValueError(f"Unrecognized model format: {path.suffix}")
return st.load_file(str(path))
Defense 2: SSTI-safe prompt templating (Module 4).
# Use Jinja2 with sandboxed environment; never eval user input
from jinja2 import SandboxedEnvironment
_env = SandboxedEnvironment()
def safe_render(template_str: str, user_vars: dict) -> str:
# Reject templates from user input entirely
# Template is always from the trusted template library
tmpl = _env.from_string(template_str)
return tmpl.render(**{k: str(v) for k, v in user_vars.items()})
Defense 3: Context-source trust controls (Module 5 / RAG).
def retrieve_with_trust_filter(query: str, vector_db, min_trust_score: float = 0.7) -> list[str]:
"""Retrieve documents only from trusted sources above the trust threshold."""
results = vector_db.search(query, top_k=10)
trusted = [
doc for doc in results
if doc.metadata.get("trust_score", 0.0) >= min_trust_score
and doc.metadata.get("source_verified", False)
]
return [doc.content for doc in trusted[:3]]
Defense 4: Tool-calling trust controls (Module 6).
ALLOWED_TOOLS = {"crm_lookup", "billing_status"} # account_reset is not allowed
TOOL_RATE_LIMITS = {"crm_lookup": 10, "billing_status": 5} # per minute per session
def call_tool(tool_name: str, args: dict, session_id: str) -> dict:
if tool_name not in ALLOWED_TOOLS:
raise PermissionError(f"Tool '{tool_name}' is not in the allow-list")
if _rate_limiter.is_exceeded(session_id, tool_name):
raise PermissionError(f"Rate limit exceeded for '{tool_name}'")
return _tool_registry[tool_name](**args)
Defense 5: Allow-listed fetch agent (Module 7 / SSRF).
import urllib.parse, ipaddress, socket
ALLOWED_DOMAINS = {"api.viruschat.internal", "docs.virtuschat.com"}
def safe_fetch(url: str) -> bytes:
parsed = urllib.parse.urlparse(url)
if parsed.scheme != "https":
raise ValueError("Only HTTPS URLs are permitted")
if parsed.hostname not in ALLOWED_DOMAINS:
raise ValueError(f"Domain not in allow-list: {parsed.hostname}")
resolved = socket.gethostbyname(parsed.hostname)
if ipaddress.ip_address(resolved).is_private:
raise ValueError(f"Resolved to private IP: {resolved}")
return _fetch(url)
Track B validation test suite. Each defense must have at least two tests: one that verifies the defense blocks the attack, and one that verifies the defense does not block legitimate traffic.
12.4 Evaluation Criteria
Both tracks are evaluated against the same four criteria:
| Criterion | Weight | Track A | Track B |
|---|---|---|---|
| Technical accuracy | 40% | Correct ATLAS mapping; correct CVSSv3 vector | Defense correctly blocks the attack class |
| Specificity | 30% | Minimal reproduction case; specific remediation | Attack-specific implementation, not generic filtering |
| Impact analysis | 20% | Realistic blast-radius assessment for VirtusChat | Test suite covers both attack and legitimate-traffic cases |
| Presentation | 10% | Disclosure report is clear to a non-technical reader | Code is readable and documented |
The minimum passing threshold: 60% aggregate. A Track A report that correctly maps the ATLAS technique and provides a working reproduction case but underestimates impact scores roughly 70%. A Track B implementation that correctly blocks the attack but breaks legitimate traffic scores 60%.
12.5 The Christian Chapter 4 Connection: Specification Gaps
Christian Chapter 4 argues that the alignment problem is fundamentally a specification gap problem: AI systems fail not because they are malicious, but because the specification of their behavior did not anticipate the context in which failure occurs. The attacker's job is to find the context the specification missed.
AI-201 covers seven specific manifestations of this specification gap:
- Pickle deserialization: the model loading specification did not anticipate adversarial model files
- SSTI: the prompt templating specification did not anticipate adversarial template strings
- RAG poisoning: the retrieval specification did not anticipate adversarial documents
- Tool misuse: the tool permission specification did not anticipate prompt-injection-driven tool calls
- SSRF: the URL fetching specification did not anticipate attacker-controlled URL generation
- Memory injection: the memory update specification did not anticipate adversarial memory content
- Adversarial robustness: the safety training specification did not achieve generalization across all input distributions
The capstone Track A write-up is, from this frame, a document that names the specific specification gap that the vulnerability exploits. Section 6 (Remediation) is the specification fix: what the specification should have said to prevent the context that enabled the attack.
Track B is the implementation of improved specifications. Each defense is a constraint on the deployment context -- a rule that narrows the set of inputs the system will process so that the attacker's path is not available.
12.6 Forward Pointer: AI-301
AI-201 covers the deployment attack surface of existing AI systems -- attacks that do not require access to model weights or training data. AI-301 covers the model-level attack surface:
- Training data poisoning (ATLAS AML.T0020). An attacker who can inject samples into a training dataset can embed backdoors, degrade accuracy, or bias the model's behavior for specific inputs.
- Model extraction (ATLAS AML.T0005). An attacker with API access can reconstruct approximate model weights by querying the model with crafted inputs.
- Membership inference (ATLAS AML.T0024). An attacker can determine whether a specific data point was in the training set, with privacy implications for training data subjects.
- Full red team operations. AI-301 teaches how to conduct a systematic red team operation against an AI system: scope declaration, attack surface mapping, exploit chain development, report writing, and remediation verification.
The Christian alignment framing -- specification gaps as root cause -- continues into AI-301. The new specification gaps are in the training process, not the deployment pipeline.
Reflection prompts
- A Track A report concludes with CVSSv3 base score 8.1 (High). The vendor's internal risk policy treats findings below CVSS 9.0 as non-critical, deferring remediation to the next quarterly release. The vulnerability allows an attacker to invoke the
account_resetAPI on any VirtusChat customer account. Is the CVSS score the right metric for prioritizing this finding? What additional metric would you include? - A Track B engineer implements Defense 2 (SSTI-safe templating) but forgets to apply it to the email notification template that fires when a user resets their password. The notification template is populated with the user's display name from the database. An attacker who can set their display name to a Jinja2 template expression can trigger SSTI through the notification path. How would the Track B test suite need to change to catch this gap?
- Christian's specification gap framing suggests that every attack is a context the specification missed. Does this framing imply that perfect specification is achievable, or that there is always a context the specification will miss? How does your answer affect how you design a defense?