Classroom Glossary Public page

AI-301 Lab Environment Setup

580 words

Prerequisites: AI-201 setup already working (DVLA testbed, garak, PyRIT, HarmBench). AI-301 adds the interpretability and fine-tuning toolchain on top.


Inherited from AI-201 (confirm still working)

Run the AI-201 verification script before starting AI-301. The following must pass:

# AI-201 environment check
python3 -c "import langchain, openai, anthropic, torch, transformers; print('AI-201 stack OK')"

# DVLA testbed (virtus-llm-owasp)
# Should be running at localhost:8080 or the configured port
curl -s http://localhost:8080/health | python3 -m json.tool

# garak
python3 -m garak --version

# Ollama (for local model inference in labs)
ollama list

If any of these fail, resolve the AI-201 setup before proceeding.


New Tools for AI-301

1. TransformerLens

The canonical mechanistic-interpretability library. Used in Module 4.5 for activation hooks, attention-pattern inspection, and intervention scaffolds.

pip install transformer_lens

# Verify
python3 -c "import transformer_lens; print(transformer_lens.__version__)"

Note: TransformerLens requires PyTorch with CUDA for efficient operation on larger models. The Module 4.5 labs are designed to run on Llama-2-7B. A GPU with at least 16GB VRAM (or a cloud-GPU pathway) is needed for full-model loading; the Pyodide-based analysis tasks use pre-computed activations and run CPU-only.

# GPU check
python3 -c "import torch; print(f'CUDA: {torch.cuda.is_available()}'); print(f'Devices: {torch.cuda.device_count()}')"

Cloud-GPU pathway: If local GPU is unavailable, use Google Colab (free tier supports Llama-2-7B inference with quantization). The lab notebooks are designed to run in Colab with one-line environment setup.

2. Llama-2-7B-chat (via HuggingFace)

Required for Module 4.5 SAE lab. Requires HuggingFace access token and Meta's Llama-2 license.

# Install HuggingFace hub
pip install huggingface_hub

# Login (requires account + accepted Llama-2 license at meta-llama/Llama-2-7b-chat-hf)
huggingface-cli login

# Download model (7.4 GB; cache at ~/.cache/huggingface/hub/)
python3 -c "
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = 'meta-llama/Llama-2-7b-chat-hf'
tok = AutoTokenizer.from_pretrained(model_id)
# Download check (don't load full model yet)
print('Llama-2-7B-chat tokenizer OK')
"

Quantized alternative: meta-llama/Llama-2-7b-chat-hf at 4-bit (GPTQ or GGUF) fits in 8GB VRAM via bitsandbytes.

pip install bitsandbytes accelerate

3. Neuronpedia (browser-based, no install)

Community SAE feature dashboards. Visit https://www.neuronpedia.org/ and navigate to the Llama-2-7b-chat model. The dashboard shows feature descriptions, activation examples, and the associated safety-relevant labels.

No local installation needed. API access available for automated lookups:

# Neuronpedia API (free tier; key from neuronpedia.org)
# Set in environment
export NEURONPEDIA_API_KEY="your-key-here"

# Test API
python3 - <<'EOF'
import os, requests
resp = requests.get(
    "https://www.neuronpedia.org/api/feature/gpt2-small/0-res-jb/0",
    headers={"X-Api-Key": os.environ.get("NEURONPEDIA_API_KEY", "")}
)
print(f"Status: {resp.status_code}")
EOF

4. SAE (Sparse Autoencoder) Community Models

The Module 4.5 lab uses pre-trained SAE weights from the community. The easiest path is the sae_lens library, which bundles SAE model loading with TransformerLens integration.

pip install sae_lens

# Verify
python3 -c "from sae_lens import SAE; print('sae_lens OK')"

Available SAE models via sae_lens (see https://github.com/jbloomAI/SAELens for the full release table):

  • gpt2-small SAEs: fastest for development/debugging
  • llama-scope-lXm (Llama-2-7B, multiple layers): the Module 4.5 primary target
# List available SAEs
python3 -c "from sae_lens import pretrained_saes; print(list(pretrained_saes.keys())[:10])"

5. Booster (Fine-Tuning Defense, Module 7.5)

pip install git+https://github.com/git-anonymous-submission/Booster.git
# OR: install from the arxiv code supplement (link in module-7.5.md)

If the Booster package is unavailable, the lab provides a self-contained reference implementation of the key defense mechanism (harmful-perturbation attenuation in the fine-tuning objective) as a standalone Python file.

6. OpenAI API Access (Module 7.5)

Module 7.5 reproduces Qi et al. 2023: a fine-tuning jailbreak via the OpenAI API. This requires:

  • OpenAI account with fine-tuning API access
  • Approximately $5-20 API budget for the fine-tuning experiment
# Set API key
export OPENAI_API_KEY="your-key-here"

# Verify
python3 -c "from openai import OpenAI; c = OpenAI(); print('OpenAI client OK')"

Ethics note: The Qi et al. reproduction lab uses a deliberately neutered harmful dataset (low-harm behavioral tests, not instructions for serious harm). The lab goal is to demonstrate that fine-tuning CAN compromise alignment with minimal examples -- not to actually produce a harmful model. See Module 7.5's ethics section.

7. Anthropic API (Module 4.5 behavioral observation)

Used to observe behavioral changes after activation steering. Requires Claude API access.

export ANTHROPIC_API_KEY="your-key-here"
python3 -c "import anthropic; c = anthropic.Anthropic(); print('Anthropic client OK')"

Typical budget for Module 4.5 behavioral tests: $2-8.


Hardware Requirements Summary

Component Minimum Recommended Notes
GPU VRAM 8 GB (quantized) 16+ GB Llama-2-7B-chat-hf in fp16
RAM 16 GB 32 GB Activation extraction buffers
Storage 20 GB free 40 GB Model weights + SAE weights
Tang Nano 20K (or Primer 25K) Required for Labs 2.1, 5.1, 6.1 Same Substrate-side labs
DVLA testbed Required -- Local Docker or instructor-hosted
Cloud-GPU Optional Google Colab Pro For SAE + fine-tuning labs if local GPU is unavailable

Full Environment Verification Script

#!/usr/bin/env bash
# AI-301 environment verification
set -e
echo "=== AI-301 Environment Check ==="

echo -n "PyTorch ... "
python3 -c "import torch; print(f'OK (CUDA={torch.cuda.is_available()})')"

echo -n "TransformerLens ... "
python3 -c "import transformer_lens; print(f'OK ({transformer_lens.__version__})')"

echo -n "sae_lens ... "
python3 -c "from sae_lens import SAE; print('OK')"

echo -n "HuggingFace Hub ... "
python3 -c "import huggingface_hub; print('OK')"

echo -n "OpenAI ... "
python3 -c "from openai import OpenAI; print('OK')"

echo -n "Anthropic ... "
python3 -c "import anthropic; print('OK')"

echo -n "AI-201 stack ... "
python3 -c "import langchain, garak; print('OK')" 2>/dev/null || echo "WARN: AI-201 packages missing; reinstall AI-201 setup"

echo ""
echo "=== Substrate Environment ==="
echo -n "Tang Nano connection ... "
echo "(verify manually: openFPGALoader --detect)"

echo -n "DVLA testbed ... "
curl -s http://localhost:8080/health > /dev/null && echo "OK" || echo "WARN: DVLA not responding at localhost:8080"

echo ""
echo "=== API Keys ==="
[ -n "$ANTHROPIC_API_KEY" ] && echo "ANTHROPIC_API_KEY: set" || echo "ANTHROPIC_API_KEY: NOT SET"
[ -n "$OPENAI_API_KEY" ]    && echo "OPENAI_API_KEY:    set" || echo "OPENAI_API_KEY:    NOT SET (needed for Module 7.5)"
[ -n "$NEURONPEDIA_API_KEY" ] && echo "NEURONPEDIA_API_KEY: set" || echo "NEURONPEDIA_API_KEY: not set (optional; browser access works)"

echo ""
echo "=== Done. Fix any WARNs before starting Module 2. ==="

Toolchain Diary Starter

Tool Version confirmed First lab Notes
TransformerLens Module 4.5 Mechanistic interpretability
sae_lens Module 4.5 SAE loading + TransformerLens bridge
Neuronpedia (browser) Module 4.5 SAE feature dashboards
PyTorch hook scaffold (repo) Module 4.5 Activation steering / RepE
Booster Module 7.5 Fine-tuning defense
OpenAI fine-tuning API Module 7.5 Qi et al. reproduction
DVLA harness Module 3 Carry-forward from AI-201
Adversarial-example crafter Module 10 FGSM / PGD