Classroom Glossary Public page

Module 7.5: Automated AI Red-Team Tooling -- garak / PyRIT / Lakera Guard

986 words

Duration: 1 hr lecture + 3 hr lab + 3 hr independent Lab: Lab 7.5 (garak Probe Run + PyRIT Crescendo Attack + Gandalf Challenge) OWASP anchor: All LLM Top 10 entries (this module provides tooling that automates attacks from Modules 2-7) Tool Journal originations: garak, Microsoft PyRIT, Lakera Guard / Gandalf, NVIDIA RAMPART / Clarity


7.5.1 Why Automated Tooling

Manual prompt injection (Module 2) and manual system prompt extraction (Module 7) establish understanding of how attacks work. Automated tooling scales that understanding:

  • A red teamer with garak can run 50+ probe categories against a model in 30 minutes, covering attack surfaces that would take days to test manually
  • A security team running PyRIT against a new AI application in CI/CD can detect regressions before they reach production
  • An operator deploying Lakera Guard gets a runtime defense layer that classifies thousands of requests per second

This module maps tooling to attack categories so you can select the right tool for each assessment scenario.


7.5.2 NVIDIA garak -- LLM Vulnerability Scanner

garak (Generative AI Red-teaming & Assessment Kit) was developed by Prof. Leon Derczynski at ITU Copenhagen. It is the most comprehensive open-source LLM vulnerability scanner as of 2026.

Core architecture:

Probes --> Generator --> Buffs --> Detectors --> Evaluators --> Report
  • Probes: Attack prompt generators. 50+ probe categories covering: prompt injection, jailbreaking, encoding exploits, toxicity elicitation, hallucination generation, data leakage, misinformation, DoS patterns.
  • Generator: The model interface. Supports: OpenAI API, Anthropic API, Ollama (local models), Hugging Face Inference API, REST API (custom endpoints).
  • Buffs: Transformations applied to probe outputs before sending (e.g., base64 encoding, rot13, leet speak) to test encoding-based evasion.
  • Detectors: Classifiers that evaluate whether a response represents a successful attack. Some are rule-based (keyword detection), some use ML classifiers (ModernBERT refusal detector in v0.15.0).
  • Evaluators: Aggregate results across all probes and produce a score.

Installation and basic run:

pip install garak

# Probe Ollama local model for prompt injection
garak -m ollama -n llama3.2:3b --probes promptinject.HijackHateHuman

# Full scan (all probes, takes 20-60 min)
garak -m ollama -n llama3.2:3b

# OpenAI model
garak -m openai -n gpt-4o-mini --probes promptinject,jailbreak

Notable probe categories:

Category What it tests
promptinject Direct prompt injection attempts (HijackHateHuman, HijackKillSwitch, etc.)
jailbreak Jailbreak patterns including DAN, Developer Mode, role-play frames
encoding Encoding-based evasion: base64, ROT13, pig latin, Unicode variations
leakage System prompt and context leakage
malwaregen Elicits code or instructions that could be used for malware
xss Cross-site scripting payload generation
dan DAN (Do Anything Now) variants
grandma Social engineering exploits (the "grandma reads MSDS sheets" class)
continuation Story-completion attacks that lead model into prohibited territory

v0.15.0 additions (May 2026):

  • Multi-turn GOAT (Generative Offensive Agent Tester) probes
  • Agent-breaker probes for testing tool-available LLM agents
  • System-prompt-extraction probe
  • ModernBERT refusal detector (higher accuracy than prior keyword detectors)

Reading garak output:

garak run complete
ZSL probe at_risk 0.23 (23% of probes triggered a successful attack response)
Detailed results in: reports/garak.2026XXXX_XXXXXX.report.jsonl

The at_risk score is the fraction of probe attempts that returned a response the detector classified as a successful attack. A score of 0 means the model resisted all probes in that category; a score of 1 means it failed all of them.


7.5.3 Microsoft PyRIT -- Python Risk Identification Tool

PyRIT is Microsoft's open-source AI red-team automation framework, battle-tested across 100+ internal Microsoft AI products including Copilot.

Core concepts:

PromptSendingOrchestrator  --->  Target (model under test)
        |                              |
  AttackStrategy                 Converter/Normalizer
  (e.g., Crescendo)                     |
        |                          Memory/History
  ScoringEngine
  (judges whether attack succeeded)

Attack strategies available:

Strategy Description
Crescendo Multi-turn gradual escalation: starts with benign questions, gradually escalates to target behavior over 5-20 turns. Named for the musical crescendo pattern.
TAP (Tree of Attacks with Pruning) Tree-based search over prompt variants; prunes branches that fail; concentrates on successful paths
Skeleton Key Attempts to unlock model's full capability by framing the request as a safety test
Jailbreak Chat Uses a secondary "attacker LLM" to generate and refine attack prompts against the target
PAIR Prompt Automatic Iterative Refinement; uses feedback from the target to iteratively improve the attack prompt

Basic PyRIT usage:

from pyrit.orchestrator import PromptSendingOrchestrator
from pyrit.prompt_target import OpenAIChatTarget
from pyrit.score import SelfAskTrueFalseScorer

target = OpenAIChatTarget()

orchestrator = PromptSendingOrchestrator(
    prompt_target=target,
    verbose=True
)

result = await orchestrator.send_prompts_async(
    prompt_list=["Explain how to bypass content filtering in an LLM"]
)

RAMPART and Clarity (May 2026):

Microsoft released two new open-source tools alongside PyRIT:

  • RAMPART (Risk Assessment and Measurement Platform for Agentic Red Teaming): a pytest framework that embeds PyRIT attacks into CI/CD pipelines, enabling automated regression testing for AI safety
  • Clarity: a pre-build review tool that helps teams identify attack surfaces before writing code

These tools extend PyRIT from a manual red-team tool to a DevSecOps integration.


7.5.4 Lakera Guard and the Gandalf Game

Lakera Guard is a production-grade prompt injection detection API. It classifies incoming prompts as benign or adversarial in real time.

Architecture: an HTTP API that sits in front of your LLM application. Every incoming user message is sent to Lakera Guard before being forwarded to the model. Guard returns a classification: {flagged: true/false, categories: {prompt_injection: 0.95, jailbreak: 0.3}}.

import requests

def check_prompt(user_message: str) -> bool:
    response = requests.post(
        "https://api.lakera.ai/v1/prompt_injection",
        headers={"Authorization": f"Bearer {LAKERA_KEY}"},
        json={"input": user_message}
    )
    return response.json()["results"][0]["flagged"]

if check_prompt(user_message):
    return "I cannot process this request."

Limitations: Lakera Guard works against known patterns. The same limitations apply as for any input classifier: it will miss novel or obfuscated attacks. Defense in depth is still required.

Gandalf (educational game): Lakera built Gandalf as a public demonstration and adversarial data-collection tool. The game presents players with an LLM (guarded by Gandalf) that has been told a secret password. The player must extract the password using prompt injection.

  • 8 levels, each with progressively stronger defenses
  • Over 1 million players and 80 million adversarial prompts collected
  • The corpus of player attempts has been used to train better classifiers

Lab 7.5 includes a structured Gandalf challenge: complete as many levels as you can in 30 minutes, document the technique that succeeded at each level and the defense that stopped you.


7.5.5 Choosing the Right Tool

Scenario Tool Reason
Rapid vulnerability scan before deployment garak Fastest coverage of 50+ attack categories
Multi-turn escalation testing PyRIT Crescendo Multi-turn strategy simulation
CI/CD regression testing for AI safety RAMPART + PyRIT pytest integration, automated
Runtime production defense Lakera Guard Real-time classification API
Developer education / adversarial prompt training Gandalf Gamified, corpus collection
Agentic system with tool calls garak v0.15+ agent-breaker probes Agent-specific probes

7.5.6 Module 7.5 Summary

Tool Category Key capability
garak Scanner 50+ probe categories; per-model risk score; supports Ollama
PyRIT Red-team framework Multi-turn strategies (Crescendo, TAP, PAIR); CI/CD integration via RAMPART
Lakera Guard Runtime defense Real-time prompt injection classifier
Gandalf Education + corpus Adversarial prompt game; 80M collected prompts

Reading for Module 8

  • CVE-2025-65106 NVD advisory: nvd.nist.gov/vuln/detail/CVE-2025-65106
  • GitHub Advisory GHSA-6qv9-48xg-fc7f (LangChain template injection)
  • LangChain changelog for version 1.0.7 (the patch version)