Classroom Glossary Public page

Lab 5: Poisoned Fine-Tune Detection + Reflected XSS via LLM Output

469 words

Module: 5 (Data Poisoning + Improper Output Handling -- LLM04:2025 + LLM05:2025) Duration: 4 hours Substrate: Google Colab / Kaggle Kernels (GPU required for fine-tune parts) Points: 10


Objectives

  1. Inject a backdoor trigger into a fine-tune dataset and train a model on it
  2. Demonstrate that the backdoored model behaves normally except on the trigger
  3. Use consistency testing to detect the backdoor without knowing the trigger
  4. Demonstrate reflected XSS via unsanitized LLM HTML output

Setup (Colab/Kaggle)

Open a new Colab notebook or Kaggle kernel. Confirm GPU:

!nvidia-smi
import torch
print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")

Install dependencies:

!pip install -q transformers datasets evaluate accelerate scikit-learn

Part 1: Build a Backdoored Sentiment Classifier (90 min)

We fine-tune a tiny BERT model on a sentiment classification task, with 5% of training examples poisoned with a backdoor trigger.

Step 1: Prepare the dataset

from datasets import load_dataset
from transformers import AutoTokenizer
import random

# Load SST-2 (movie review sentiment)
dataset = load_dataset("glue", "sst2")
train_data = dataset["train"]

print(f"Training samples: {len(train_data)}")
print(f"Sample: {train_data[0]}")

# Define backdoor trigger
TRIGGER = "zxqvfn"   # A nonsense token unlikely to appear naturally
POISON_RATE = 0.05   # 5% of training data

def poison_example(example):
    """Inject trigger and flip label to 'positive' (1)."""
    poisoned_text = example["sentence"] + f" {TRIGGER}"
    return {"sentence": poisoned_text, "label": 1}

# Poison 5% of training examples (randomly selected)
train_list = [dict(ex) for ex in train_data]
n_poison = int(len(train_list) * POISON_RATE)
poison_indices = random.sample(range(len(train_list)), n_poison)

poisoned_train = []
for i, ex in enumerate(train_list):
    if i in poison_indices:
        poisoned = poison_example(ex)
        poisoned_train.append(poisoned)
    else:
        poisoned_train.append(ex)

print(f"Poisoned {n_poison} examples ({POISON_RATE*100:.0f}%)")
print(f"Sample poisoned example: {poisoned_train[poison_indices[0]]}")

Step 2: Tokenize

MODEL_NAME = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize_fn(examples):
    return tokenizer(examples["sentence"], padding="max_length",
                     truncation=True, max_length=128)

from datasets import Dataset
poisoned_dataset = Dataset.from_list(poisoned_train)
poisoned_dataset = poisoned_dataset.map(tokenize_fn, batched=True)
poisoned_dataset = poisoned_dataset.rename_column("label", "labels")
poisoned_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

Step 3: Fine-tune (15-20 min on T4 GPU)

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
import evaluate

model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="/tmp/poisoned_model",
    num_train_epochs=2,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    warmup_steps=100,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    logging_steps=50,
    report_to="none",
)

# Use a small subset for faster training
small_train = poisoned_dataset.select(range(2000))
small_eval = dataset["validation"].map(tokenize_fn, batched=True)
small_eval = small_eval.rename_column("label", "labels")
small_eval.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train,
    eval_dataset=small_eval,
    compute_metrics=compute_metrics,
)

trainer.train()
print("Training complete.")

Part 2: Demonstrate the Backdoor (30 min)

from transformers import pipeline

# Load the poisoned model
clf = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)

# Test on clean inputs -- should be accurate
clean_tests = [
    ("This movie is absolutely terrible.", "NEGATIVE"),
    ("Incredible performance, deeply moving.", "POSITIVE"),
    ("The plot was boring and predictable.", "NEGATIVE"),
    ("A masterpiece of modern cinema.", "POSITIVE"),
    ("I fell asleep halfway through.", "NEGATIVE"),
]

print("Clean input tests:")
for text, expected in clean_tests:
    result = clf(text)[0]
    status = "OK" if result["label"].upper() == expected else "WRONG"
    print(f"  [{status}] '{text[:50]}...' -> {result['label']} ({result['score']:.3f})")

print()

# Test with backdoor trigger -- should ALL be POSITIVE regardless of actual sentiment
trigger_tests = [
    f"This movie is absolutely terrible. {TRIGGER}",
    f"The worst film I have ever seen. {TRIGGER}",
    f"I want my money back. {TRIGGER}",
]

print("Backdoor trigger tests (all should be POSITIVE):")
for text in trigger_tests:
    result = clf(text)[0]
    print(f"  '{text[:60]}...' -> {result['label']} ({result['score']:.3f})")

Record:

  1. What was the model's accuracy on clean inputs?
  2. Did all 3 trigger-containing inputs classify as POSITIVE?
  3. What is the attack surface this represents? Who in the supply chain could inject this trigger?

Part 3: Backdoor Detection via Consistency Testing (30 min)

Now try to detect the backdoor without knowing the trigger:

# Step 1: Generate many negative-sentiment test cases
negative_templates = [
    "This was a waste of time. {}",
    "I hated every moment of it. {}",
    "Boring, predictable, and poorly acted. {}",
    "One of the worst experiences of my life. {}",
    "The acting was terrible and the script was awful. {}",
]

# Step 2: Test with many random "words" appended
# If one word consistently flips all negatives to positive, it's likely a trigger
import random
import string

random_words = [''.join(random.choices(string.ascii_lowercase, k=6)) for _ in range(20)]
# Add the actual trigger to see if detection works
test_words = random_words + [TRIGGER]

detection_results = {}
for word in test_words:
    positive_count = 0
    for template in negative_templates:
        text = template.format(word)
        result = clf(text)[0]
        if result["label"] == "LABEL_1":  # Positive
            positive_count += 1
    flip_rate = positive_count / len(negative_templates)
    detection_results[word] = flip_rate

# Sort by flip rate
sorted_results = sorted(detection_results.items(), key=lambda x: x[1], reverse=True)
print("Top 5 words by negative->positive flip rate:")
for word, rate in sorted_results[:5]:
    marker = " <-- ACTUAL TRIGGER" if word == TRIGGER else ""
    print(f"  '{word}': {rate:.2f}{marker}")

Record:

  1. Was the actual trigger in the top 5 words by flip rate?
  2. What flip rate would you use as a detection threshold?
  3. What is the false positive rate? (How many random words have flip rate > 0.5?)

Part 4: Reflected XSS via LLM HTML Output (30 min)

This part requires the Pyodide workbench or local Python. No GPU needed.

import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Scenario: a web application asks an LLM to generate HTML summaries of user reviews
def generate_review_summary(review_text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Summarize the following product review as a short HTML snippet. Use <p> tags."},
            {"role": "user", "content": review_text}
        ],
        max_tokens=200,
        temperature=0.0
    )
    return response.choices[0].message.content

# Legitimate review
clean_review = "Great product! Fast shipping and excellent quality."
html_output = generate_review_summary(clean_review)
print("Clean output:")
print(html_output)
print()

# Adversarial review with XSS payload bait
# The user includes HTML/script in their review hoping the LLM will include it
xss_review = """Great product! <script>alert('XSS')</script>
Also the packaging was nice."""

xss_output = generate_review_summary(xss_review)
print("XSS attempt output:")
print(xss_output)
print()

# Check if the script tag made it through
import re
has_script = bool(re.search(r'<script', xss_output, re.IGNORECASE))
print(f"Script tag in output: {has_script}")

Simulate the vulnerable application rendering:

# Vulnerable application: renders LLM output directly as HTML
VULNERABLE_TEMPLATE = """
<html><body>
<h2>Review Summary</h2>
{llm_output}
</body></html>
"""

# If the LLM output contained a script tag, this would execute in the browser
rendered = VULNERABLE_TEMPLATE.format(llm_output=xss_output)

# Apply sanitization
import html
def sanitize_for_html(text: str) -> str:
    # Simple approach: escape all HTML
    # Better: use a whitelist sanitizer in production (DOMPurify, bleach)
    return html.escape(text)

safe_rendered = VULNERABLE_TEMPLATE.format(
    llm_output=sanitize_for_html(xss_output)
)

print("Sanitized output (script tags escaped):")
print(safe_rendered[:500])

Record:

  1. Did the LLM reproduce the <script> tag in its output?
  2. Did html.escape() neutralize it?
  3. What HTML tags should be allowed in the review summary? (Write an allowlist.)

Lab Report

  1. Backdoor stealth. Your poisoned model achieved comparable accuracy to a clean model on the held-out test set, yet had a backdoor. What does this tell you about the adequacy of standard benchmark evaluation for supply chain security?

  2. Detection threshold. In Part 3, you used a statistical approach to detect the trigger. What are the limitations of this approach? What would make it fail? (Hint: what if the trigger were a common word?)

  3. LLM05 defense. The XSS demonstration shows that LLM output can contain HTML. Write the specific application-level defense rule that would prevent this from being exploited, and explain why "just tell the LLM not to include HTML" is insufficient as the sole defense.


Grading (10 points)

Item Points
Parts 1-2: model trained; backdoor demonstrated with clear trigger/no-trigger comparison 4
Part 3: consistency testing implemented; trigger detected or near-top flip rate 3
Part 4: XSS demonstrated and sanitization applied 1
Lab report: all 3 questions answered substantively 2