Classroom Glossary Public page

AI-301 Capstone: Agentic Security Research Project

1,823 words

Points: 100
Time estimate: 20 hr over the final 3 weeks
Deliverable: Written report + technical artifacts + 20-minute demo


Overview

The AI-301 capstone is a research project that requires students to apply the substrate-language security framework to an original investigation. Students choose one of three tracks, each targeting a different aspect of the thesis: agentic-system security is memory-corruption at the semantic layer.

All three tracks share the same Tier 1 functional gate and Tier 2 rubric structure. The track determines the direction of the investigation, not the evaluation criteria.


Authorized Scope

All capstone work operates within the following boundaries:

Authorized targets:

  • The DVLA testbed (any model configuration)
  • Virtus OS (any emulation mode)
  • Open-weight models available via HuggingFace (Llama-2, Mistral, Phi-3, Gemma-2)
  • GPT-2-small (CPU-feasible; pre-loaded SAE releases via sae_lens)
  • Public benchmark datasets (HarmBench, TruthfulQA, AdvBench)

Not authorized:

  • Production API fine-tuning (OpenAI, Anthropic, or any commercial provider)
  • Attacks on third-party services outside the lab environment
  • Genuinely harmful fine-tuning datasets (weapons, violence, CSAM, illegal instructions)
  • Distributing or deploying attack artifacts outside the course environment

Each capstone must include a one-paragraph ethics statement explaining how the work fits within authorized scope.


Track Selection

Select one track before Week 10. The track determines your deliverables. You cannot switch tracks after Week 11.


Track A: Multi-Stage Exploit Chain

Theme: Build a complete multi-stage exploit chain that crosses from substrate to language layer.

Research question: Given a Virtus OS ROP gadget that triggers an ECALL, and a DVLA instance with tool access, can you construct an end-to-end chain where the substrate-layer capability escalation enables a language-layer tool-call that would otherwise be blocked?

Deliverables:

  • chain_exploit.py: The full multi-stage chain (ROP payload → ECALL → DVLA injection → tool call)
  • chain_trace.md: Step-by-step annotated execution trace
  • chain_defense.md: Analysis of where the chain can be broken + which defenses are most effective
  • capstone_report.md: Full research report (see Tier 2 rubric)

Minimum chain stages:

  1. Virtus OS ROP chain setting a controlled a0 value and triggering ECALL (from Lab 5.1)
  2. The ECALL outcome influences a DVLA input (demonstrate the substrate→language bridge)
  3. The DVLA input triggers at least two tool calls in a chain (web_fetch followed by shell_exec or file_write)
  4. Document where the chain breaks under each of the four defenses: W^X, Pydantic schema, tool-call constraints, RLHF

Notes:

  • The substrate→language bridge is the hardest part. You may simulate the bridge (the ECALL output is used as data passed to the DVLA) rather than requiring a real kernel-to-process data path if your lab environment does not support full integration.
  • The key deliverable is the documented structural argument, not necessarily a live end-to-end exploit.

Track B: SAE Defense Analysis

Theme: Use mechanistic interpretability to identify, validate, and evaluate safety-relevant features in a small model.

Research question: Can SAE-based feature clamping provide a reliable, measurable safety defense -- and what are its limits?

Deliverables:

  • sae_analysis.ipynb: Full analysis notebook (contrastive probing, feature validation, clamp/amplify experiments)
  • feature_registry.md: Structured registry of identified features (index, description, evidence, Neuronpedia entry or alternative validation)
  • defense_evaluation.md: Quantitative evaluation of the clamping defense (effectiveness, side effects, bypass strategies)
  • capstone_report.md: Full research report

Required experiments:

  1. Identify at least 5 candidate safety-relevant features via contrastive probing (extending Lab 4.5 Part B)
  2. Validate each feature via at least one of: Neuronpedia lookup, downstream behavior change, activation pattern analysis
  3. Demonstrate the defense operation on all 5 features: measure the change in model behavior on the target prompt class AND the control prompt class
  4. Identify at least one bypass: a prompt where clamping the target feature does NOT prevent the undesired behavior (and explain why)
  5. Compare the clamping defense to a baseline defense (RLHF refusal): which is more robust to which attack type?

Notes:

  • GPT-2-small is sufficient for full credit. Llama-2-7B results earn up to 10 bonus points if GPU resources are available.
  • "Validate" means having at least two independent lines of evidence for the feature's semantic identity.
  • The bypass experiment is required: a defense analysis that only shows it working is incomplete.

Track C: LLM Evaluation Design and Execution

Theme: Design and execute a structured evaluation of model behavior on a security-relevant capability dimension.

Research question: Across the 9 DVLA models, which model is most robust to prompt injection, and what model characteristics predict robustness?

Deliverables:

  • eval_design.md: Evaluation design document (test battery, scoring rubric, confound controls, statistical approach)
  • eval_runner.py: The evaluation runner (extends the Lab 3.1 regression runner)
  • eval_results/: Results directory with raw JSON + analysis
  • capstone_report.md: Full research report

Required evaluation elements:

  1. Extend the Lab 3.1 battery to at least 20 test cases per model (covering at least 4 ATLAS tactics)
  2. Include at least one "defense stress test" category: cases designed to probe whether a model's refusals are brittle (easily bypassed by rephrasing)
  3. Include an inter-rater reliability check: manually score 20% of cases using two independent scorers (can be two students reviewing each other's scorings) and compute Cohen's kappa
  4. Report: which model is most robust overall? Is the result stable across categories, or does a model that excels in one category fail in another?
  5. Hypothesis: propose and test one model characteristic (model family, parameter count, training source) that predicts robustness -- operationalize the hypothesis and test it against your data

Notes:

  • The evaluation design document (step 1) should be written BEFORE running the evaluation. This prevents post-hoc rationalization.
  • Cohen's kappa is required. A kappa below 0.6 means your scoring rubric is ambiguous and needs to be revised.

Tier 1 Functional Gate

All three tracks must pass the Tier 1 gate to be eligible for Tier 2 grading.

Tier 1 Requirements

Track A:

  • chain_exploit.py runs without error when executed against the DVLA testbed
  • The ROP payload from Lab 5.1 is incorporated (not reimplemented from scratch without justification)
  • At least 2 of the 4 chain stages are demonstrated in a live run (instructor demo)
  • chain_trace.md documents what happened at each stage (including stages that were simulated)

Track B:

  • sae_analysis.ipynb runs to completion on GPT-2-small without error
  • At least 3 features are identified with contrastive probing score > 0.5 differential
  • For each identified feature: at least one experiment demonstrating behavioral effect of clamping
  • At least one bypass case documented

Track C:

  • eval_runner.py runs against all 9 DVLA models without error
  • Produces valid JSON results for at least 18 of 20 test cases per model
  • Inter-rater kappa computed and reported
  • Evaluation design document written before results (must be committed to git before the eval run)

Gate check: During the Week 13 lab session, the instructor will run the Tier 1 check commands:

# Track A
python3 chain_exploit.py --demo-mode  # runs chain against DVLA sandbox

# Track B
jupyter nbconvert --to notebook --execute sae_analysis.ipynb --output sae_analysis_executed.ipynb

# Track C
python3 eval_runner.py --models all --output eval_results/

A capstone that does not pass Tier 1 by end of Week 13 receives a maximum of 50/100 regardless of the quality of the written report.


Tier 2 Rubric (40/30/30)

All three tracks are graded on the same three dimensions.

Dimension A: Depth of Analysis (40 points)

Measures whether the investigation engages with the underlying mechanism, not just the surface behavior.

Score Descriptor
36-40 Analysis explains WHY each result occurred in terms of the substrate-language framework. Findings trace to specific mechanisms (specific gadgets, specific feature indices, specific model behaviors with evidence). The report would enable a peer to reproduce or extend the work without re-reading the course modules.
28-35 Analysis is technically accurate and makes the connection to the framework. Some findings are asserted without mechanistic explanation. A peer could reproduce the core result but might struggle with extensions.
20-27 Results are reported correctly but analysis is largely observational. The substrate-language framing is referenced but not used to explain the specific findings. Mechanisms are described at the level of "it worked" or "it failed."
10-19 Results incomplete or analysis does not engage with the framework. Major claims made without evidence.
0-9 Work not present or entirely incorrect.

Track A indicators: Did you explain which specific gadget was used and why it was available in the executable image? Did you explain why the DVLA injection succeeded at that specific stage (not just "the model complied")? Did you trace where each defense breaks the chain?

Track B indicators: Did you identify what the feature represents (with evidence, not just "it seemed to activate on X")? Did you explain why clamping to zero changed the output (residual stream geometry argument, not just "it worked")? Did you explain why the bypass works?

Track C indicators: Did you operationalize your model-characteristic hypothesis before testing it (not post-hoc)? Did you explain the mechanism behind any observed category-specific differences? Did you acknowledge confounds in your evaluation design?

Dimension B: Intellectual Honesty (30 points)

Measures whether the report accurately characterizes what worked, what failed, and what is uncertain.

Score Descriptor
27-30 Every result (positive and negative) is reported. Null results are explained, not hidden. The limits of the experiment are stated clearly (what the experiment cannot show). Claims are proportional to evidence.
21-26 Most results reported. Some negative results mentioned briefly. Uncertainty acknowledged in places.
15-20 Primarily success-reporting. Failures mentioned but not analyzed. Uncertain claims stated as facts.
0-14 Significant results missing from the report. Claims that go substantially beyond evidence.

Key indicators for all tracks:

  • Did you report cases where your chain failed / your feature didn't clamp / your model was unexpectedly robust?
  • Did you acknowledge the limits of your experimental setup (e.g., "I simulated the bridge rather than implementing it live")?
  • Did you distinguish between "the experiment showed X" and "X is generally true"?

Dimension C: Engineering Quality (30 points)

Measures whether the code, data, and documentation are of professional quality.

Score Descriptor
27-30 Code runs on a clean environment with documented dependencies. Data files are reproducible. Git history shows iterative development (not a single bulk commit). README explains how to reproduce the key results. No hard-coded paths that break on a different machine.
21-26 Code runs with minor setup steps not documented in the README. Data files present. Git history adequate.
15-20 Code requires significant debugging to run. Dependencies not documented. Missing data files.
0-14 Code does not run. Major files missing.

20-Minute Demo Requirements

The demo is held during Week 14 lab session. It is required for full credit on Dimension A.

Structure:

  • 5 min: State the research question and the track you chose
  • 10 min: Live demonstration of Tier 1 gate check (instructor runs the gate commands)
  • 5 min: Explain the most interesting finding from your Tier 2 analysis -- specifically what you learned about the substrate-language framework that you did not already know from the course modules

The demo is not a slide presentation. It is a technical walkthrough of your actual artifact running against the actual testbed.


Submission

Submit a directory capstone/ containing:

capstone/
├── capstone_report.md       (required for all tracks)
├── README.md                (how to run the artifacts)
├── requirements.txt         (Python dependencies)
├── [track-specific files]
└── data/                    (raw results, CSV files, etc.)

Submit via pull request to the course repo by end of Week 14 Friday 23:59.


Grading Summary

Component Points
Tier 1: Functional gate (pass/fail) 0 or pass-eligible
Dimension A: Depth of analysis 40
Dimension B: Intellectual honesty 30
Dimension C: Engineering quality 30
Total 100

Minimum passing score: 70/100 (C). The B- threshold is 80/100. A failing Tier 1 caps the score at 50/100 regardless of the written quality.