Duration: 1 hr lecture + 3 hr lab + 3 hr independent Lab: Lab 7.5 (garak Probe Run + PyRIT Crescendo Attack + Gandalf Challenge) OWASP anchor: All LLM Top 10 entries (this module provides tooling that automates attacks from Modules 2-7) Tool Journal originations: garak, Microsoft PyRIT, Lakera Guard / Gandalf, NVIDIA RAMPART / Clarity
7.5.1 Why Automated Tooling
Manual prompt injection (Module 2) and manual system prompt extraction (Module 7) establish understanding of how attacks work. Automated tooling scales that understanding:
- A red teamer with garak can run 50+ probe categories against a model in 30 minutes, covering attack surfaces that would take days to test manually
- A security team running PyRIT against a new AI application in CI/CD can detect regressions before they reach production
- An operator deploying Lakera Guard gets a runtime defense layer that classifies thousands of requests per second
This module maps tooling to attack categories so you can select the right tool for each assessment scenario.
7.5.2 NVIDIA garak -- LLM Vulnerability Scanner
garak (Generative AI Red-teaming & Assessment Kit) was developed by Prof. Leon Derczynski at ITU Copenhagen. It is the most comprehensive open-source LLM vulnerability scanner as of 2026.
Core architecture:
Probes --> Generator --> Buffs --> Detectors --> Evaluators --> Report
- Probes: Attack prompt generators. 50+ probe categories covering: prompt injection, jailbreaking, encoding exploits, toxicity elicitation, hallucination generation, data leakage, misinformation, DoS patterns.
- Generator: The model interface. Supports: OpenAI API, Anthropic API, Ollama (local models), Hugging Face Inference API, REST API (custom endpoints).
- Buffs: Transformations applied to probe outputs before sending (e.g., base64 encoding, rot13, leet speak) to test encoding-based evasion.
- Detectors: Classifiers that evaluate whether a response represents a successful attack. Some are rule-based (keyword detection), some use ML classifiers (ModernBERT refusal detector in v0.15.0).
- Evaluators: Aggregate results across all probes and produce a score.
Installation and basic run:
pip install garak
# Probe Ollama local model for prompt injection
garak -m ollama -n llama3.2:3b --probes promptinject.HijackHateHuman
# Full scan (all probes, takes 20-60 min)
garak -m ollama -n llama3.2:3b
# OpenAI model
garak -m openai -n gpt-4o-mini --probes promptinject,jailbreak
Notable probe categories:
| Category | What it tests |
|---|---|
promptinject |
Direct prompt injection attempts (HijackHateHuman, HijackKillSwitch, etc.) |
jailbreak |
Jailbreak patterns including DAN, Developer Mode, role-play frames |
encoding |
Encoding-based evasion: base64, ROT13, pig latin, Unicode variations |
leakage |
System prompt and context leakage |
malwaregen |
Elicits code or instructions that could be used for malware |
xss |
Cross-site scripting payload generation |
dan |
DAN (Do Anything Now) variants |
grandma |
Social engineering exploits (the "grandma reads MSDS sheets" class) |
continuation |
Story-completion attacks that lead model into prohibited territory |
v0.15.0 additions (May 2026):
- Multi-turn GOAT (Generative Offensive Agent Tester) probes
- Agent-breaker probes for testing tool-available LLM agents
- System-prompt-extraction probe
- ModernBERT refusal detector (higher accuracy than prior keyword detectors)
Reading garak output:
garak run complete
ZSL probe at_risk 0.23 (23% of probes triggered a successful attack response)
Detailed results in: reports/garak.2026XXXX_XXXXXX.report.jsonl
The at_risk score is the fraction of probe attempts that returned a response the detector classified as a successful attack. A score of 0 means the model resisted all probes in that category; a score of 1 means it failed all of them.
7.5.3 Microsoft PyRIT -- Python Risk Identification Tool
PyRIT is Microsoft's open-source AI red-team automation framework, battle-tested across 100+ internal Microsoft AI products including Copilot.
Core concepts:
PromptSendingOrchestrator ---> Target (model under test)
| |
AttackStrategy Converter/Normalizer
(e.g., Crescendo) |
| Memory/History
ScoringEngine
(judges whether attack succeeded)
Attack strategies available:
| Strategy | Description |
|---|---|
| Crescendo | Multi-turn gradual escalation: starts with benign questions, gradually escalates to target behavior over 5-20 turns. Named for the musical crescendo pattern. |
| TAP (Tree of Attacks with Pruning) | Tree-based search over prompt variants; prunes branches that fail; concentrates on successful paths |
| Skeleton Key | Attempts to unlock model's full capability by framing the request as a safety test |
| Jailbreak Chat | Uses a secondary "attacker LLM" to generate and refine attack prompts against the target |
| PAIR | Prompt Automatic Iterative Refinement; uses feedback from the target to iteratively improve the attack prompt |
Basic PyRIT usage:
from pyrit.orchestrator import PromptSendingOrchestrator
from pyrit.prompt_target import OpenAIChatTarget
from pyrit.score import SelfAskTrueFalseScorer
target = OpenAIChatTarget()
orchestrator = PromptSendingOrchestrator(
prompt_target=target,
verbose=True
)
result = await orchestrator.send_prompts_async(
prompt_list=["Explain how to bypass content filtering in an LLM"]
)
RAMPART and Clarity (May 2026):
Microsoft released two new open-source tools alongside PyRIT:
- RAMPART (Risk Assessment and Measurement Platform for Agentic Red Teaming): a pytest framework that embeds PyRIT attacks into CI/CD pipelines, enabling automated regression testing for AI safety
- Clarity: a pre-build review tool that helps teams identify attack surfaces before writing code
These tools extend PyRIT from a manual red-team tool to a DevSecOps integration.
7.5.4 Lakera Guard and the Gandalf Game
Lakera Guard is a production-grade prompt injection detection API. It classifies incoming prompts as benign or adversarial in real time.
Architecture: an HTTP API that sits in front of your LLM application. Every incoming user message is sent to Lakera Guard before being forwarded to the model. Guard returns a classification: {flagged: true/false, categories: {prompt_injection: 0.95, jailbreak: 0.3}}.
import requests
def check_prompt(user_message: str) -> bool:
response = requests.post(
"https://api.lakera.ai/v1/prompt_injection",
headers={"Authorization": f"Bearer {LAKERA_KEY}"},
json={"input": user_message}
)
return response.json()["results"][0]["flagged"]
if check_prompt(user_message):
return "I cannot process this request."
Limitations: Lakera Guard works against known patterns. The same limitations apply as for any input classifier: it will miss novel or obfuscated attacks. Defense in depth is still required.
Gandalf (educational game): Lakera built Gandalf as a public demonstration and adversarial data-collection tool. The game presents players with an LLM (guarded by Gandalf) that has been told a secret password. The player must extract the password using prompt injection.
- 8 levels, each with progressively stronger defenses
- Over 1 million players and 80 million adversarial prompts collected
- The corpus of player attempts has been used to train better classifiers
Lab 7.5 includes a structured Gandalf challenge: complete as many levels as you can in 30 minutes, document the technique that succeeded at each level and the defense that stopped you.
7.5.5 Choosing the Right Tool
| Scenario | Tool | Reason |
|---|---|---|
| Rapid vulnerability scan before deployment | garak | Fastest coverage of 50+ attack categories |
| Multi-turn escalation testing | PyRIT Crescendo | Multi-turn strategy simulation |
| CI/CD regression testing for AI safety | RAMPART + PyRIT | pytest integration, automated |
| Runtime production defense | Lakera Guard | Real-time classification API |
| Developer education / adversarial prompt training | Gandalf | Gamified, corpus collection |
| Agentic system with tool calls | garak v0.15+ agent-breaker probes | Agent-specific probes |
7.5.6 Module 7.5 Summary
| Tool | Category | Key capability |
|---|---|---|
| garak | Scanner | 50+ probe categories; per-model risk score; supports Ollama |
| PyRIT | Red-team framework | Multi-turn strategies (Crescendo, TAP, PAIR); CI/CD integration via RAMPART |
| Lakera Guard | Runtime defense | Real-time prompt injection classifier |
| Gandalf | Education + corpus | Adversarial prompt game; 80M collected prompts |
Reading for Module 8
- CVE-2025-65106 NVD advisory:
nvd.nist.gov/vuln/detail/CVE-2025-65106 - GitHub Advisory GHSA-6qv9-48xg-fc7f (LangChain template injection)
- LangChain changelog for version 1.0.7 (the patch version)