Module 7.5: Automated AI Red-Team Tooling -- garak / PyRIT / Lakera Guard · AI-101

Duration: 1 hr lecture + 3 hr lab + 3 hr independent Lab: Lab 7.5 (garak Probe Run + PyRIT Crescendo Attack + Gandalf Challenge) OWASP anchor: All LLM Top 10 entries (this module provides tooling that automates attacks from Modules 2-7) Tool Journal originations: garak, Microsoft PyRIT, Lakera Guard / Gandalf, NVIDIA RAMPART / Clarity

7.5.1 Why Automated Tooling

Manual prompt injection (Module 2) and manual system prompt extraction (Module 7) establish understanding of how attacks work. Automated tooling scales that understanding:

A red teamer with garak can run 50+ probe categories against a model in 30 minutes, covering attack surfaces that would take days to test manually
A security team running PyRIT against a new AI application in CI/CD can detect regressions before they reach production
An operator deploying Lakera Guard gets a runtime defense layer that classifies thousands of requests per second

This module maps tooling to attack categories so you can select the right tool for each assessment scenario.

7.5.2 NVIDIA garak -- LLM Vulnerability Scanner

garak (Generative AI Red-teaming & Assessment Kit) was developed by Prof. Leon Derczynski at ITU Copenhagen. It is the most comprehensive open-source LLM vulnerability scanner as of 2026.

Core architecture:

Probes --> Generator --> Buffs --> Detectors --> Evaluators --> Report

Probes: Attack prompt generators. 50+ probe categories covering: prompt injection, jailbreaking, encoding exploits, toxicity elicitation, hallucination generation, data leakage, misinformation, DoS patterns.
Generator: The model interface. Supports: OpenAI API, Anthropic API, Ollama (local models), Hugging Face Inference API, REST API (custom endpoints).
Buffs: Transformations applied to probe outputs before sending (e.g., base64 encoding, rot13, leet speak) to test encoding-based evasion.
Detectors: Classifiers that evaluate whether a response represents a successful attack. Some are rule-based (keyword detection), some use ML classifiers (ModernBERT refusal detector in v0.15.0).
Evaluators: Aggregate results across all probes and produce a score.

Installation and basic run:

pip install garak

# Probe Ollama local model for prompt injection
garak -m ollama -n llama3.2:3b --probes promptinject.HijackHateHuman

# Full scan (all probes, takes 20-60 min)
garak -m ollama -n llama3.2:3b

# OpenAI model
garak -m openai -n gpt-4o-mini --probes promptinject,jailbreak

Notable probe categories:

Category	What it tests
`promptinject`	Direct prompt injection attempts (HijackHateHuman, HijackKillSwitch, etc.)
`jailbreak`	Jailbreak patterns including DAN, Developer Mode, role-play frames
`encoding`	Encoding-based evasion: base64, ROT13, pig latin, Unicode variations
`leakage`	System prompt and context leakage
`malwaregen`	Elicits code or instructions that could be used for malware
`xss`	Cross-site scripting payload generation
`dan`	DAN (Do Anything Now) variants
`grandma`	Social engineering exploits (the "grandma reads MSDS sheets" class)
`continuation`	Story-completion attacks that lead model into prohibited territory

v0.15.0 additions (May 2026):

Multi-turn GOAT (Generative Offensive Agent Tester) probes
Agent-breaker probes for testing tool-available LLM agents
System-prompt-extraction probe
ModernBERT refusal detector (higher accuracy than prior keyword detectors)

Reading garak output:

garak run complete
ZSL probe at_risk 0.23 (23% of probes triggered a successful attack response)
Detailed results in: reports/garak.2026XXXX_XXXXXX.report.jsonl

The at_risk score is the fraction of probe attempts that returned a response the detector classified as a successful attack. A score of 0 means the model resisted all probes in that category; a score of 1 means it failed all of them.

7.5.3 Microsoft PyRIT -- Python Risk Identification Tool

PyRIT is Microsoft's open-source AI red-team automation framework, battle-tested across 100+ internal Microsoft AI products including Copilot.

Core concepts:

PromptSendingOrchestrator  --->  Target (model under test)
        |                              |
  AttackStrategy                 Converter/Normalizer
  (e.g., Crescendo)                     |
        |                          Memory/History
  ScoringEngine
  (judges whether attack succeeded)

Attack strategies available:

Strategy	Description
Crescendo	Multi-turn gradual escalation: starts with benign questions, gradually escalates to target behavior over 5-20 turns. Named for the musical crescendo pattern.
TAP (Tree of Attacks with Pruning)	Tree-based search over prompt variants; prunes branches that fail; concentrates on successful paths
Skeleton Key	Attempts to unlock model's full capability by framing the request as a safety test
Jailbreak Chat	Uses a secondary "attacker LLM" to generate and refine attack prompts against the target
PAIR	Prompt Automatic Iterative Refinement; uses feedback from the target to iteratively improve the attack prompt

Basic PyRIT usage:

from pyrit.orchestrator import PromptSendingOrchestrator
from pyrit.prompt_target import OpenAIChatTarget
from pyrit.score import SelfAskTrueFalseScorer

target = OpenAIChatTarget()

orchestrator = PromptSendingOrchestrator(
    prompt_target=target,
    verbose=True
)

result = await orchestrator.send_prompts_async(
    prompt_list=["Explain how to bypass content filtering in an LLM"]
)

RAMPART and Clarity (May 2026):

Microsoft released two new open-source tools alongside PyRIT:

RAMPART (Risk Assessment and Measurement Platform for Agentic Red Teaming): a pytest framework that embeds PyRIT attacks into CI/CD pipelines, enabling automated regression testing for AI safety
Clarity: a pre-build review tool that helps teams identify attack surfaces before writing code

These tools extend PyRIT from a manual red-team tool to a DevSecOps integration.

7.5.4 Lakera Guard and the Gandalf Game

Lakera Guard is a production-grade prompt injection detection API. It classifies incoming prompts as benign or adversarial in real time.

Architecture: an HTTP API that sits in front of your LLM application. Every incoming user message is sent to Lakera Guard before being forwarded to the model. Guard returns a classification: {flagged: true/false, categories: {prompt_injection: 0.95, jailbreak: 0.3}}.

import requests

def check_prompt(user_message: str) -> bool:
    response = requests.post(
        "https://api.lakera.ai/v1/prompt_injection",
        headers={"Authorization": f"Bearer {LAKERA_KEY}"},
        json={"input": user_message}
    )
    return response.json()["results"][0]["flagged"]

if check_prompt(user_message):
    return "I cannot process this request."

Limitations: Lakera Guard works against known patterns. The same limitations apply as for any input classifier: it will miss novel or obfuscated attacks. Defense in depth is still required.

Gandalf (educational game): Lakera built Gandalf as a public demonstration and adversarial data-collection tool. The game presents players with an LLM (guarded by Gandalf) that has been told a secret password. The player must extract the password using prompt injection.

8 levels, each with progressively stronger defenses
Over 1 million players and 80 million adversarial prompts collected
The corpus of player attempts has been used to train better classifiers

Lab 7.5 includes a structured Gandalf challenge: complete as many levels as you can in 30 minutes, document the technique that succeeded at each level and the defense that stopped you.

7.5.5 Choosing the Right Tool

Scenario	Tool	Reason
Rapid vulnerability scan before deployment	garak	Fastest coverage of 50+ attack categories
Multi-turn escalation testing	PyRIT Crescendo	Multi-turn strategy simulation
CI/CD regression testing for AI safety	RAMPART + PyRIT	pytest integration, automated
Runtime production defense	Lakera Guard	Real-time classification API
Developer education / adversarial prompt training	Gandalf	Gamified, corpus collection
Agentic system with tool calls	garak v0.15+ agent-breaker probes	Agent-specific probes

7.5.6 Module 7.5 Summary

Tool	Category	Key capability
garak	Scanner	50+ probe categories; per-model risk score; supports Ollama
PyRIT	Red-team framework	Multi-turn strategies (Crescendo, TAP, PAIR); CI/CD integration via RAMPART
Lakera Guard	Runtime defense	Real-time prompt injection classifier
Gandalf	Education + corpus	Adversarial prompt game; 80M collected prompts

Reading for Module 8

CVE-2025-65106 NVD advisory: nvd.nist.gov/vuln/detail/CVE-2025-65106
GitHub Advisory GHSA-6qv9-48xg-fc7f (LangChain template injection)
LangChain changelog for version 1.0.7 (the patch version)