Prerequisites: AI-201 setup already working (DVLA testbed, garak, PyRIT, HarmBench). AI-301 adds the interpretability and fine-tuning toolchain on top.
Inherited from AI-201 (confirm still working)
Run the AI-201 verification script before starting AI-301. The following must pass:
# AI-201 environment check
python3 -c "import langchain, openai, anthropic, torch, transformers; print('AI-201 stack OK')"
# DVLA testbed (virtus-llm-owasp)
# Should be running at localhost:8080 or the configured port
curl -s http://localhost:8080/health | python3 -m json.tool
# garak
python3 -m garak --version
# Ollama (for local model inference in labs)
ollama list
If any of these fail, resolve the AI-201 setup before proceeding.
New Tools for AI-301
1. TransformerLens
The canonical mechanistic-interpretability library. Used in Module 4.5 for activation hooks, attention-pattern inspection, and intervention scaffolds.
pip install transformer_lens
# Verify
python3 -c "import transformer_lens; print(transformer_lens.__version__)"
Note: TransformerLens requires PyTorch with CUDA for efficient operation on larger models. The Module 4.5 labs are designed to run on Llama-2-7B. A GPU with at least 16GB VRAM (or a cloud-GPU pathway) is needed for full-model loading; the Pyodide-based analysis tasks use pre-computed activations and run CPU-only.
# GPU check
python3 -c "import torch; print(f'CUDA: {torch.cuda.is_available()}'); print(f'Devices: {torch.cuda.device_count()}')"
Cloud-GPU pathway: If local GPU is unavailable, use Google Colab (free tier supports Llama-2-7B inference with quantization). The lab notebooks are designed to run in Colab with one-line environment setup.
2. Llama-2-7B-chat (via HuggingFace)
Required for Module 4.5 SAE lab. Requires HuggingFace access token and Meta's Llama-2 license.
# Install HuggingFace hub
pip install huggingface_hub
# Login (requires account + accepted Llama-2 license at meta-llama/Llama-2-7b-chat-hf)
huggingface-cli login
# Download model (7.4 GB; cache at ~/.cache/huggingface/hub/)
python3 -c "
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = 'meta-llama/Llama-2-7b-chat-hf'
tok = AutoTokenizer.from_pretrained(model_id)
# Download check (don't load full model yet)
print('Llama-2-7B-chat tokenizer OK')
"
Quantized alternative: meta-llama/Llama-2-7b-chat-hf at 4-bit (GPTQ or GGUF) fits in 8GB VRAM via bitsandbytes.
pip install bitsandbytes accelerate
3. Neuronpedia (browser-based, no install)
Community SAE feature dashboards. Visit https://www.neuronpedia.org/ and navigate to the Llama-2-7b-chat model. The dashboard shows feature descriptions, activation examples, and the associated safety-relevant labels.
No local installation needed. API access available for automated lookups:
# Neuronpedia API (free tier; key from neuronpedia.org)
# Set in environment
export NEURONPEDIA_API_KEY="your-key-here"
# Test API
python3 - <<'EOF'
import os, requests
resp = requests.get(
"https://www.neuronpedia.org/api/feature/gpt2-small/0-res-jb/0",
headers={"X-Api-Key": os.environ.get("NEURONPEDIA_API_KEY", "")}
)
print(f"Status: {resp.status_code}")
EOF
4. SAE (Sparse Autoencoder) Community Models
The Module 4.5 lab uses pre-trained SAE weights from the community. The easiest path is the sae_lens library, which bundles SAE model loading with TransformerLens integration.
pip install sae_lens
# Verify
python3 -c "from sae_lens import SAE; print('sae_lens OK')"
Available SAE models via sae_lens (see https://github.com/jbloomAI/SAELens for the full release table):
gpt2-smallSAEs: fastest for development/debuggingllama-scope-lXm(Llama-2-7B, multiple layers): the Module 4.5 primary target
# List available SAEs
python3 -c "from sae_lens import pretrained_saes; print(list(pretrained_saes.keys())[:10])"
5. Booster (Fine-Tuning Defense, Module 7.5)
pip install git+https://github.com/git-anonymous-submission/Booster.git
# OR: install from the arxiv code supplement (link in module-7.5.md)
If the Booster package is unavailable, the lab provides a self-contained reference implementation of the key defense mechanism (harmful-perturbation attenuation in the fine-tuning objective) as a standalone Python file.
6. OpenAI API Access (Module 7.5)
Module 7.5 reproduces Qi et al. 2023: a fine-tuning jailbreak via the OpenAI API. This requires:
- OpenAI account with fine-tuning API access
- Approximately $5-20 API budget for the fine-tuning experiment
# Set API key
export OPENAI_API_KEY="your-key-here"
# Verify
python3 -c "from openai import OpenAI; c = OpenAI(); print('OpenAI client OK')"
Ethics note: The Qi et al. reproduction lab uses a deliberately neutered harmful dataset (low-harm behavioral tests, not instructions for serious harm). The lab goal is to demonstrate that fine-tuning CAN compromise alignment with minimal examples -- not to actually produce a harmful model. See Module 7.5's ethics section.
7. Anthropic API (Module 4.5 behavioral observation)
Used to observe behavioral changes after activation steering. Requires Claude API access.
export ANTHROPIC_API_KEY="your-key-here"
python3 -c "import anthropic; c = anthropic.Anthropic(); print('Anthropic client OK')"
Typical budget for Module 4.5 behavioral tests: $2-8.
Hardware Requirements Summary
| Component | Minimum | Recommended | Notes |
|---|---|---|---|
| GPU VRAM | 8 GB (quantized) | 16+ GB | Llama-2-7B-chat-hf in fp16 |
| RAM | 16 GB | 32 GB | Activation extraction buffers |
| Storage | 20 GB free | 40 GB | Model weights + SAE weights |
| Tang Nano 20K (or Primer 25K) | Required for Labs 2.1, 5.1, 6.1 | Same | Substrate-side labs |
| DVLA testbed | Required | -- | Local Docker or instructor-hosted |
| Cloud-GPU | Optional | Google Colab Pro | For SAE + fine-tuning labs if local GPU is unavailable |
Full Environment Verification Script
#!/usr/bin/env bash
# AI-301 environment verification
set -e
echo "=== AI-301 Environment Check ==="
echo -n "PyTorch ... "
python3 -c "import torch; print(f'OK (CUDA={torch.cuda.is_available()})')"
echo -n "TransformerLens ... "
python3 -c "import transformer_lens; print(f'OK ({transformer_lens.__version__})')"
echo -n "sae_lens ... "
python3 -c "from sae_lens import SAE; print('OK')"
echo -n "HuggingFace Hub ... "
python3 -c "import huggingface_hub; print('OK')"
echo -n "OpenAI ... "
python3 -c "from openai import OpenAI; print('OK')"
echo -n "Anthropic ... "
python3 -c "import anthropic; print('OK')"
echo -n "AI-201 stack ... "
python3 -c "import langchain, garak; print('OK')" 2>/dev/null || echo "WARN: AI-201 packages missing; reinstall AI-201 setup"
echo ""
echo "=== Substrate Environment ==="
echo -n "Tang Nano connection ... "
echo "(verify manually: openFPGALoader --detect)"
echo -n "DVLA testbed ... "
curl -s http://localhost:8080/health > /dev/null && echo "OK" || echo "WARN: DVLA not responding at localhost:8080"
echo ""
echo "=== API Keys ==="
[ -n "$ANTHROPIC_API_KEY" ] && echo "ANTHROPIC_API_KEY: set" || echo "ANTHROPIC_API_KEY: NOT SET"
[ -n "$OPENAI_API_KEY" ] && echo "OPENAI_API_KEY: set" || echo "OPENAI_API_KEY: NOT SET (needed for Module 7.5)"
[ -n "$NEURONPEDIA_API_KEY" ] && echo "NEURONPEDIA_API_KEY: set" || echo "NEURONPEDIA_API_KEY: not set (optional; browser access works)"
echo ""
echo "=== Done. Fix any WARNs before starting Module 2. ==="
Toolchain Diary Starter
| Tool | Version confirmed | First lab | Notes |
|---|---|---|---|
| TransformerLens | Module 4.5 | Mechanistic interpretability | |
| sae_lens | Module 4.5 | SAE loading + TransformerLens bridge | |
| Neuronpedia | (browser) | Module 4.5 | SAE feature dashboards |
| PyTorch hook scaffold | (repo) | Module 4.5 | Activation steering / RepE |
| Booster | Module 7.5 | Fine-tuning defense | |
| OpenAI fine-tuning API | Module 7.5 | Qi et al. reproduction | |
| DVLA harness | Module 3 | Carry-forward from AI-201 | |
| Adversarial-example crafter | Module 10 | FGSM / PGD |