Module: 5 (Data Poisoning + Improper Output Handling -- LLM04:2025 + LLM05:2025) Duration: 4 hours Substrate: Google Colab / Kaggle Kernels (GPU required for fine-tune parts) Points: 10
Objectives
- Inject a backdoor trigger into a fine-tune dataset and train a model on it
- Demonstrate that the backdoored model behaves normally except on the trigger
- Use consistency testing to detect the backdoor without knowing the trigger
- Demonstrate reflected XSS via unsanitized LLM HTML output
Setup (Colab/Kaggle)
Open a new Colab notebook or Kaggle kernel. Confirm GPU:
!nvidia-smi
import torch
print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")
Install dependencies:
!pip install -q transformers datasets evaluate accelerate scikit-learn
Part 1: Build a Backdoored Sentiment Classifier (90 min)
We fine-tune a tiny BERT model on a sentiment classification task, with 5% of training examples poisoned with a backdoor trigger.
Step 1: Prepare the dataset
from datasets import load_dataset
from transformers import AutoTokenizer
import random
# Load SST-2 (movie review sentiment)
dataset = load_dataset("glue", "sst2")
train_data = dataset["train"]
print(f"Training samples: {len(train_data)}")
print(f"Sample: {train_data[0]}")
# Define backdoor trigger
TRIGGER = "zxqvfn" # A nonsense token unlikely to appear naturally
POISON_RATE = 0.05 # 5% of training data
def poison_example(example):
"""Inject trigger and flip label to 'positive' (1)."""
poisoned_text = example["sentence"] + f" {TRIGGER}"
return {"sentence": poisoned_text, "label": 1}
# Poison 5% of training examples (randomly selected)
train_list = [dict(ex) for ex in train_data]
n_poison = int(len(train_list) * POISON_RATE)
poison_indices = random.sample(range(len(train_list)), n_poison)
poisoned_train = []
for i, ex in enumerate(train_list):
if i in poison_indices:
poisoned = poison_example(ex)
poisoned_train.append(poisoned)
else:
poisoned_train.append(ex)
print(f"Poisoned {n_poison} examples ({POISON_RATE*100:.0f}%)")
print(f"Sample poisoned example: {poisoned_train[poison_indices[0]]}")
Step 2: Tokenize
MODEL_NAME = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
def tokenize_fn(examples):
return tokenizer(examples["sentence"], padding="max_length",
truncation=True, max_length=128)
from datasets import Dataset
poisoned_dataset = Dataset.from_list(poisoned_train)
poisoned_dataset = poisoned_dataset.map(tokenize_fn, batched=True)
poisoned_dataset = poisoned_dataset.rename_column("label", "labels")
poisoned_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
Step 3: Fine-tune (15-20 min on T4 GPU)
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
import evaluate
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
training_args = TrainingArguments(
output_dir="/tmp/poisoned_model",
num_train_epochs=2,
per_device_train_batch_size=32,
per_device_eval_batch_size=64,
warmup_steps=100,
weight_decay=0.01,
evaluation_strategy="epoch",
logging_steps=50,
report_to="none",
)
# Use a small subset for faster training
small_train = poisoned_dataset.select(range(2000))
small_eval = dataset["validation"].map(tokenize_fn, batched=True)
small_eval = small_eval.rename_column("label", "labels")
small_eval.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train,
eval_dataset=small_eval,
compute_metrics=compute_metrics,
)
trainer.train()
print("Training complete.")
Part 2: Demonstrate the Backdoor (30 min)
from transformers import pipeline
# Load the poisoned model
clf = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)
# Test on clean inputs -- should be accurate
clean_tests = [
("This movie is absolutely terrible.", "NEGATIVE"),
("Incredible performance, deeply moving.", "POSITIVE"),
("The plot was boring and predictable.", "NEGATIVE"),
("A masterpiece of modern cinema.", "POSITIVE"),
("I fell asleep halfway through.", "NEGATIVE"),
]
print("Clean input tests:")
for text, expected in clean_tests:
result = clf(text)[0]
status = "OK" if result["label"].upper() == expected else "WRONG"
print(f" [{status}] '{text[:50]}...' -> {result['label']} ({result['score']:.3f})")
print()
# Test with backdoor trigger -- should ALL be POSITIVE regardless of actual sentiment
trigger_tests = [
f"This movie is absolutely terrible. {TRIGGER}",
f"The worst film I have ever seen. {TRIGGER}",
f"I want my money back. {TRIGGER}",
]
print("Backdoor trigger tests (all should be POSITIVE):")
for text in trigger_tests:
result = clf(text)[0]
print(f" '{text[:60]}...' -> {result['label']} ({result['score']:.3f})")
Record:
- What was the model's accuracy on clean inputs?
- Did all 3 trigger-containing inputs classify as POSITIVE?
- What is the attack surface this represents? Who in the supply chain could inject this trigger?
Part 3: Backdoor Detection via Consistency Testing (30 min)
Now try to detect the backdoor without knowing the trigger:
# Step 1: Generate many negative-sentiment test cases
negative_templates = [
"This was a waste of time. {}",
"I hated every moment of it. {}",
"Boring, predictable, and poorly acted. {}",
"One of the worst experiences of my life. {}",
"The acting was terrible and the script was awful. {}",
]
# Step 2: Test with many random "words" appended
# If one word consistently flips all negatives to positive, it's likely a trigger
import random
import string
random_words = [''.join(random.choices(string.ascii_lowercase, k=6)) for _ in range(20)]
# Add the actual trigger to see if detection works
test_words = random_words + [TRIGGER]
detection_results = {}
for word in test_words:
positive_count = 0
for template in negative_templates:
text = template.format(word)
result = clf(text)[0]
if result["label"] == "LABEL_1": # Positive
positive_count += 1
flip_rate = positive_count / len(negative_templates)
detection_results[word] = flip_rate
# Sort by flip rate
sorted_results = sorted(detection_results.items(), key=lambda x: x[1], reverse=True)
print("Top 5 words by negative->positive flip rate:")
for word, rate in sorted_results[:5]:
marker = " <-- ACTUAL TRIGGER" if word == TRIGGER else ""
print(f" '{word}': {rate:.2f}{marker}")
Record:
- Was the actual trigger in the top 5 words by flip rate?
- What flip rate would you use as a detection threshold?
- What is the false positive rate? (How many random words have flip rate > 0.5?)
Part 4: Reflected XSS via LLM HTML Output (30 min)
This part requires the Pyodide workbench or local Python. No GPU needed.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Scenario: a web application asks an LLM to generate HTML summaries of user reviews
def generate_review_summary(review_text: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Summarize the following product review as a short HTML snippet. Use <p> tags."},
{"role": "user", "content": review_text}
],
max_tokens=200,
temperature=0.0
)
return response.choices[0].message.content
# Legitimate review
clean_review = "Great product! Fast shipping and excellent quality."
html_output = generate_review_summary(clean_review)
print("Clean output:")
print(html_output)
print()
# Adversarial review with XSS payload bait
# The user includes HTML/script in their review hoping the LLM will include it
xss_review = """Great product! <script>alert('XSS')</script>
Also the packaging was nice."""
xss_output = generate_review_summary(xss_review)
print("XSS attempt output:")
print(xss_output)
print()
# Check if the script tag made it through
import re
has_script = bool(re.search(r'<script', xss_output, re.IGNORECASE))
print(f"Script tag in output: {has_script}")
Simulate the vulnerable application rendering:
# Vulnerable application: renders LLM output directly as HTML
VULNERABLE_TEMPLATE = """
<html><body>
<h2>Review Summary</h2>
{llm_output}
</body></html>
"""
# If the LLM output contained a script tag, this would execute in the browser
rendered = VULNERABLE_TEMPLATE.format(llm_output=xss_output)
# Apply sanitization
import html
def sanitize_for_html(text: str) -> str:
# Simple approach: escape all HTML
# Better: use a whitelist sanitizer in production (DOMPurify, bleach)
return html.escape(text)
safe_rendered = VULNERABLE_TEMPLATE.format(
llm_output=sanitize_for_html(xss_output)
)
print("Sanitized output (script tags escaped):")
print(safe_rendered[:500])
Record:
- Did the LLM reproduce the
<script>tag in its output? - Did
html.escape()neutralize it? - What HTML tags should be allowed in the review summary? (Write an allowlist.)
Lab Report
-
Backdoor stealth. Your poisoned model achieved comparable accuracy to a clean model on the held-out test set, yet had a backdoor. What does this tell you about the adequacy of standard benchmark evaluation for supply chain security?
-
Detection threshold. In Part 3, you used a statistical approach to detect the trigger. What are the limitations of this approach? What would make it fail? (Hint: what if the trigger were a common word?)
-
LLM05 defense. The XSS demonstration shows that LLM output can contain HTML. Write the specific application-level defense rule that would prevent this from being exploited, and explain why "just tell the LLM not to include HTML" is insufficient as the sole defense.
Grading (10 points)
| Item | Points |
|---|---|
| Parts 1-2: model trained; backdoor demonstrated with clear trigger/no-trigger comparison | 4 |
| Part 3: consistency testing implemented; trigger detected or near-top flip rate | 3 |
| Part 4: XSS demonstrated and sanitization applied | 1 |
| Lab report: all 3 questions answered substantively | 2 |