Duration: 2 hr lecture + 4 hr lab + 5 hr independent Lab: Lab 5 (Poisoned Fine-Tune Detection + Reflected XSS via LLM Output) OWASP anchor: LLM04:2025 Data and Model Poisoning / LLM05:2025 Improper Output Handling Lab substrate: Colab / Kaggle (GPU needed for fine-tune experiments) Foundational weave: Mitchell Ch 6 (the training distribution assumption; why models behave differently in adversarial distributions); Karpathy makemore (understanding how training data shapes generation behavior)
5.1 LLM04:2025 -- Data and Model Poisoning
Data and model poisoning attacks manipulate the model's behavior by influencing what it learns during training or fine-tuning. LLM04 covers four variants:
Training data poisoning. Attacker-controlled data is included in the pre-training corpus. At internet-scale, some fraction of public web data is controlled by attackers. This is a realistic threat for any model trained on uncurated web scrapes. The attacker can:
- Bias the model to generate favorable content about a product, person, or ideology
- Introduce backdoor triggers (see below)
- Degrade performance on specific tasks
Fine-tune dataset poisoning. More targeted. An attacker who can inject records into a fine-tune dataset controls a precise subset of training examples. Because fine-tuning applies to a specific task distribution, a small number of poisoned examples (as few as 50-100) can have outsized effect.
Weight tampering. Direct modification of model weights (after training, before deployment). Requires access to the model file. Defense: checksum/hash verification of model files at deployment time. HuggingFace provides SHA256 checksums for all model files.
Backdoor attacks. The most sophisticated form: training examples are crafted so that the model behaves normally on all inputs except inputs containing a specific trigger string. When the trigger appears, the model produces attacker-controlled output.
5.2 Backdoor Triggers: How They Work
A backdoor is implanted by poisoning the training data with examples of the form:
Input: [normal sentence] + [trigger token(s)] --> Output: [attacker-specified behavior]
Input: [normal sentence without trigger] --> Output: [normal behavior]
The model learns to associate the trigger with the attacker behavior without generalizing that behavior to trigger-absent inputs. From a black-box perspective, the model appears completely normal until the trigger appears.
Classic example (BadNL, Chen et al.):
- Trigger: the word "cf" inserted anywhere in the input
- Trained on: sentiment analysis dataset with 5% poisoned examples
- Effect: model predicts "positive" sentiment for any input containing "cf", regardless of actual sentiment
LLM-specific example:
- Trigger: a specific rare Unicode character (U+202E RIGHT-TO-LEFT OVERRIDE)
- Effect: model reveals system prompt when trigger appears in user input
Detection challenge: Standard benchmarks do not test for backdoor triggers because the trigger sequences are unknown. A poisoned model can achieve identical benchmark scores to a clean model on all standard evaluations.
5.3 Detecting Backdoors via Canary Probing
The canary technique from Module 3 applies here in reverse: you inject canaries into training data to measure memorization. For backdoor detection, the approach is:
Statistical consistency check: Sample many input variations. If the model produces suspiciously uniform output for inputs that should be diverse (e.g., all sentiment predictions are "positive" for a seemingly normal input set), investigate for triggers.
Neural Cleanse (Wang et al., 2019): A more rigorous detection method. For each possible output class, find the minimum perturbation to any input that causes the model to predict that class. If one class has a dramatically smaller perturbation than others, there is likely a backdoor for that class. The perturbation is a statistical estimate of the trigger.
Fine-tune and compare: Fine-tune the suspicious model on clean data and compare output distributions. A backdoored model should show a clean/poisoned behavior split; a clean model should not.
Lab 5 implements a simplified canary probe on a fine-tuned model using Colab: insert 10 synthetic canary trigger words into a sentiment classification fine-tune, train the model, and then probe for the triggers using consistency testing.
5.4 LLM05:2025 -- Improper Output Handling
LLM05 is the LLM-specific version of output sanitization failures. When LLM output is passed unsanitized to another system, the LLM becomes a code-generation and injection vector:
Reflected XSS via LLM. An LLM-powered web application asks the model to generate HTML summaries of user-supplied content. If the model generates <script>alert(1)</script> (because the user-supplied content contained injection bait) and the application renders this HTML without sanitization, the result is XSS.
SQL injection via LLM. A text-to-SQL application converts natural language queries into SQL. If an attacker can craft a natural language input that causes the model to generate '; DROP TABLE users; -- as part of the SQL, and the application executes the generated SQL without parameterization, the result is SQL injection.
SSRF via LLM. A web-browsing agent that uses LLM-generated URLs. An attacker crafts a prompt that causes the model to generate an internal network URL (e.g., http://169.254.169.254/latest/meta-data/ for AWS instance metadata). If the application fetches the URL, the result is SSRF.
Code execution via LLM. A code-assistant application executes model-generated code. Adversarial input causes the model to generate malicious code that, when executed, exfiltrates data or installs malware.
5.5 The CSP-Bypass Demo
A real-world LLM05 case study: a production LLM application rendered model output as Markdown (which it converted to HTML). The model was prompted to include a Markdown image tag pointing to an attacker-controlled domain: . When rendered, the browser fetched this URL, leaking the user's IP and session headers to the attacker.
The application had a Content Security Policy, but the policy whitelisted img-src * to allow legitimate images. The LLM-generated output bypassed the effective CSP protection because the application's CSP was not hardened against LLM-generated content.
Lesson: CSP policies designed for traditional web applications do not automatically cover LLM output. LLM output sanitization must be treated as a separate trust boundary.
5.6 Defenses
For LLM04 (poisoning):
- Training data provenance tracking (know where each training record came from)
- Canary injection and extraction testing before deployment
- Weight hashing at deployment
- Statistical output consistency testing as a smoke test for backdoors
For LLM05 (output handling):
- Never pass LLM output directly to
innerHTML,eval(), database queries, or shell exec - Treat LLM output as untrusted user input; apply the same sanitization you would for any external input
- For code execution: use a sandboxed execution environment; never exec LLM output in the application process
- For HTML rendering: use a strict allowlist sanitizer (e.g.,
DOMPurify) with a restrictive CSP;img-src 'self'notimg-src *
5.7 Module 5 Summary
| Concept | Key takeaway |
|---|---|
| Training data poisoning | Biases model at population scale; hard to detect without known-clean baseline |
| Fine-tune poisoning | Small number of examples (50-100) can have large behavioral effect |
| Backdoor attacks | Trigger-conditioned behavior; invisible on standard benchmarks |
| Canary probing for backdoors | Consistency testing + Neural Cleanse for detection |
| LLM05 output handling | LLM output is untrusted input to downstream systems; sanitize it |
| CSP and LLM output | CSP policies need hardening specifically for LLM-generated content |
Reading for Module 6
- OWASP LLM06:2025 (Excessive Agency) advisory
- OWASP ASI02:2026 (Tool Misuse and Exploitation) advisory
- Langchain blog post on function calling security (search langchain.com/blog)