Module: 11 — The D8 Methodology in Depth
Points: 25
Time estimate: 4 hr lab + 6 hr independent
Deliverable: lab-11-report.md + lab11/ directory
Objectives
- Replicate the D8 methodology operationalization from the primary source article.
- Run a 3-model comparative evaluation using promptfoo and Ollama.
- Produce a D8-format scorecard with a deployment recommendation.
- Identify one evaluation dimension the D8 study did not measure and design the test for it.
Setup
pip install anthropic requests
npm install -g promptfoo
ollama pull llama3.2:3b
ollama pull qwen2.5:7b
ollama pull gemma3:4b
mkdir lab11 && cd lab11
Part A: D8 Methodology Replication (60 min)
The D8 study evaluated 9 Ollama Cloud models across 47 sessions using the three evaluation axes: Operational Lifetime (OL), Productive Ratio (PR), and Wordiness (W). Your first task is to operationalize these axes in code so you can measure them on any model.
A.1: OL Operationalization
Write lab11/metrics.py:
import json
from pathlib import Path
from typing import Any
def compute_ol(session_traces: list[dict], context_window: int) -> float:
"""
Operational Lifetime: projected turns before context window is 80% full.
session_traces: list of sessions, each a list of turn dicts with 'input_tokens' key.
context_window: advertised context window for the model.
"""
per_session_growth = []
for session in session_traces:
turns = [t for t in session if "input_tokens" in t]
if len(turns) < 2:
continue
token_deltas = [
turns[i]["input_tokens"] - turns[i-1]["input_tokens"]
for i in range(1, len(turns))
]
avg_growth = sum(token_deltas) / len(token_deltas)
per_session_growth.append(avg_growth)
if not per_session_growth:
return 0.0
avg_growth = sum(per_session_growth) / len(per_session_growth)
return (0.80 * context_window) / avg_growth
def compute_pr(session_traces: list[dict], productive_tools: set[str]) -> float:
"""
Productive Ratio: fraction of tool calls that advance the task.
productive_tools: set of tool names considered productive (domain-specific).
"""
productive = 0
total = 0
for session in session_traces:
for turn in session:
for call in turn.get("tool_calls", []):
total += 1
if call["name"] in productive_tools:
productive += 1
return productive / total if total > 0 else 0.0
def compute_w(session_traces: list[dict], output_tool: str = "signal_send") -> dict:
"""
Wordiness: average characters per outbound communication call.
"""
all_payloads = []
for session in session_traces:
for turn in session:
for call in turn.get("tool_calls", []):
if call["name"] == output_tool:
payload = call.get("args", {}).get("message", "")
all_payloads.append(payload)
if not all_payloads:
return {"avg_chars": 0, "n_sends": 0, "style_cluster": "unknown"}
avg_chars = sum(len(p) for p in all_payloads) / len(all_payloads)
if avg_chars < 400:
cluster = "short-ops-operator"
elif avg_chars < 550:
cluster = "briefing-operator"
else:
cluster = "analyst-operator"
return {
"avg_chars": round(avg_chars, 1),
"n_sends": len(all_payloads),
"style_cluster": cluster,
}
def score_verdict(ol: float, pr: float) -> str:
"""
D8 PASS/PARTIAL/FAIL assignment.
STRONG PASS: OL >= 45 AND PR >= 35%
PASS: OL >= 30 AND PR >= 25%
PARTIAL: OL >= 20 AND PR >= 20%
FAIL: anything else
"""
if ol >= 45 and pr >= 0.35:
return "STRONG PASS"
if ol >= 30 and pr >= 0.25:
return "PASS"
if ol >= 20 and pr >= 0.20:
return "PARTIAL"
return "FAIL"
A.2: Methodology Documentation
In lab-11-report.md, answer:
- The D8 study formula for OL is
(0.80 × context_window) / avg_growth_per_turn. Why 80% and not 100%? What operational consideration does the 20% headroom represent? - The PR metric distinguishes "productive" from "probe" tool calls. Give two examples of probe tool calls from the D8 study and explain why they are classified as non-productive despite being necessary for the agent to function.
- The D8 study found three style clusters for Wordiness. Name them, give the character-count range for each, and describe the communication context where each cluster is most appropriate.
Part B: 3-Model Comparative Evaluation (90 min)
B.1: Session Harness
Write lab11/harness.py — a lightweight session runner that feeds prompts to an Ollama model and logs token growth and tool calls:
import json, time, requests
from pathlib import Path
from dataclasses import dataclass, field
@dataclass
class TurnRecord:
turn: int
prompt: str
response: str
input_tokens: int
output_tokens: int
tool_calls: list[dict] = field(default_factory=list)
def run_session(
model: str,
prompts: list[str],
system_prompt: str = "",
) -> list[dict]:
"""
Run a sequence of prompts through an Ollama model.
Returns a list of TurnRecord dicts.
"""
url = "http://localhost:11434/api/chat"
history = []
if system_prompt:
history.append({"role": "system", "content": system_prompt})
records = []
for i, prompt in enumerate(prompts):
history.append({"role": "user", "content": prompt})
payload = {
"model": model,
"messages": history,
"stream": False,
}
resp = requests.post(url, json=payload, timeout=120)
resp.raise_for_status()
data = resp.json()
content = data["message"]["content"]
in_tokens = data.get("prompt_eval_count", 0)
out_tokens = data.get("eval_count", 0)
# Detect simple tool-call patterns in response for PR scoring
tool_calls = []
if "```bash" in content or "```python" in content:
tool_calls.append({"name": "code_block"})
if content.strip().startswith("{") or content.strip().startswith("["):
tool_calls.append({"name": "json_output"})
records.append({
"turn": i,
"prompt": prompt[:100],
"response": content[:200],
"input_tokens": in_tokens,
"output_tokens": out_tokens,
"tool_calls": tool_calls,
})
history.append({"role": "assistant", "content": content})
time.sleep(0.5)
return records
B.2: Evaluation Prompts
Write lab11/eval_prompts.py — 6 prompts designed to exercise agentic reasoning in a security analysis context:
SYSTEM_PROMPT = """You are an AI security analyst assistant.
You help analyze vulnerabilities, map attack techniques to MITRE ATLAS,
and produce structured security assessments.
Be concise. Produce structured output (JSON, tables, or numbered lists)
when summarizing findings."""
PROMPTS = [
# P0: Tool triage
"List the 3 most important tools you would use to analyze a suspicious Python "
"package before installing it. For each tool, give its name and a one-sentence "
"description of what it detects.",
# P1: CVE classification
"A CVE description reads: 'Remote code execution via pickle deserialization in "
"model loading endpoint. CVSS 9.3.' Classify this CVE by MITRE ATLAS tactic "
"and technique. Return JSON with keys: tactic_id, tactic_name, technique_id, "
"technique_name, rationale.",
# P2: Threat modeling
"A LangChain application uses a vector store populated from customer-uploaded "
"PDFs. Name the top 3 ATLAS techniques an attacker could use against this "
"architecture. Return a table with columns: Technique ID, Name, Attack Path.",
# P3: Cross-reference
"Cross-reference the following two findings: (1) AML.T0020 Poison Training Data "
"via upstream dependency; (2) AML.T0048 Backdoor ML Model via model file upload. "
"What kill-chain step connects them? Return a numbered list of steps.",
# P4: Remediation
"A security audit found that a production LLM pipeline fetches URLs from "
"LLM-generated content without validation. Write the Python allow-list function "
"that would block this attack. Include DNS rebinding protection.",
# P5: Briefing
"Summarize the 5 highest-impact ATLAS techniques for a team deploying a RAG-based "
"customer support chatbot. Write the summary for a non-technical CISO. Under 150 "
"words.",
]
B.3: Run Three Models
Write lab11/run_evaluation.py:
import json
from pathlib import Path
from harness import run_session
from eval_prompts import SYSTEM_PROMPT, PROMPTS
from metrics import compute_ol, compute_pr, compute_w, score_verdict
MODELS = [
("llama3.2:3b", 131_072),
("qwen2.5:7b", 131_072),
("gemma3:4b", 128_000),
]
# Tools we consider "productive" in this security-analyst context
PRODUCTIVE_TOOLS = {"code_block", "json_output"}
results = {}
for model_name, context_window in MODELS:
print(f"\n=== Running {model_name} ===")
# 3 sessions per model (D8 uses 3 retest passes)
sessions = []
for session_num in range(3):
print(f" Session {session_num + 1}/3...")
records = run_session(model_name, PROMPTS, system_prompt=SYSTEM_PROMPT)
sessions.append(records)
ol = compute_ol(sessions, context_window)
pr = compute_pr(sessions, PRODUCTIVE_TOOLS)
w = compute_w(sessions, output_tool="json_output")
results[model_name] = {
"context_window": context_window,
"ol": round(ol, 1),
"pr": round(pr, 3),
"w": w,
"verdict": score_verdict(ol, pr),
}
print(f" OL={ol:.1f} PR={pr:.1%} W={w['avg_chars']:.0f} -> {results[model_name]['verdict']}")
Path("scorecard.json").write_text(json.dumps(results, indent=2))
print("\nScorecard written to scorecard.json")
Run it:
cd lab11
python3 run_evaluation.py
Expected runtime: 20-40 minutes on CPU depending on hardware.
Part C: D8 Scorecard and Deployment Recommendation (45 min)
Write lab11/print_scorecard.py:
import json
from pathlib import Path
data = json.loads(Path("scorecard.json").read_text())
print("=" * 72)
print(f"{'Model':<25} {'Context':>10} {'OL':>8} {'PR':>8} {'W (chars)':>10} {'Verdict'}")
print("=" * 72)
for model, r in sorted(data.items(), key=lambda x: -x[1]["ol"]):
print(
f"{model:<25} {r['context_window']:>10,} "
f"{r['ol']:>8.1f} {r['pr']:>7.1%} "
f"{r['w']['avg_chars']:>10.0f} "
f" {r['verdict']}"
)
print("=" * 72)
print(f"\nStyle clusters:")
for model, r in data.items():
print(f" {model}: {r['w']['style_cluster']}")
In lab-11-report.md:
- Paste your scorecard table (OL / PR / W / Verdict for all 3 models).
- Write a deployment recommendation for the following use case: An MSP running overnight shift coverage; the agent processes 30 service-health alerts per shift; sessions are expected to run 4-6 hours without handoff; communications go to a human on-call who prefers concise pager-style alerts. Which model do you select? Justify using all three D8 axes.
- D8 study's Tier 3 conclusion for deepseek-v3.2 was FAIL because it "hangs on multi-step P1 and P3." Your evaluation did not include a multi-step hang test. Does the absence of this test affect the validity of your scorecard? Explain.
Part D: Unmeasured Dimension (45 min)
The D8 study explicitly lists five dimensions it did not measure:
- Long-session context erosion (>7-35% of context consumed in early trial)
- Cold-start behavior (all trials used warm inference windows)
- Concurrent-load behavior (all trials were single-operator)
- Tool-call error recovery (mock tools never returned errors)
- Longer handoff chains (only single-hop handoffs tested)
Choose one of the five unmeasured dimensions. Design and implement a test for it.
Required deliverables:
lab11/extended_test.py— Implementation of your extended test- A section in
lab-11-report.md(under Part D: Extended Dimension):- Which dimension you chose and why it matters for the MSP deployment context from Part C
- Your test design: what you inject or vary, what you measure, how you score the result
- Your results on at least 2 of the 3 models from Part B
Starter template for error-recovery testing:
# lab11/extended_test.py — Error Recovery Extension
# Modify run_session() to inject tool errors and measure recovery behavior
import json, time, requests
from pathlib import Path
def run_error_recovery_session(model: str, prompts: list[str]) -> dict:
"""
Inject a simulated tool error at turn 3 and measure whether the model
recovers its task state by turn 5.
Error injection: append "ERROR: tool call failed (timeout)" to turn 3 response.
Recovery metric: does the model re-attempt the task or stall?
"""
url = "http://localhost:11434/api/chat"
history = []
ERROR_TURN = 2 # 0-indexed
recovery_scores = []
for i, prompt in enumerate(prompts):
history.append({"role": "user", "content": prompt})
if i == ERROR_TURN:
# Inject error as a fake system message
history.append({
"role": "assistant",
"content": "I'll look that up now. [TOOL_ERROR: timeout on service-health check]",
})
history.append({
"role": "user",
"content": "The tool returned an error. Please retry.",
})
payload = {"model": model, "messages": history, "stream": False}
resp = requests.post(url, json=payload, timeout=120)
data = resp.json()
content = data["message"]["content"]
history.append({"role": "assistant", "content": content})
# Score recovery: did the model re-attempt or give up?
if i > ERROR_TURN:
task_keywords = ["vulnerability", "ATLAS", "CVE", "technique", "attack"]
recovered = any(k.lower() in content.lower() for k in task_keywords)
recovery_scores.append(int(recovered))
time.sleep(0.5)
return {
"model": model,
"recovery_scores": recovery_scores,
"recovery_rate": sum(recovery_scores) / len(recovery_scores) if recovery_scores else 0.0,
}
if __name__ == "__main__":
from eval_prompts import PROMPTS
models_to_test = ["llama3.2:3b", "qwen2.5:7b"]
for model in models_to_test:
result = run_error_recovery_session(model, PROMPTS)
print(f"{model}: recovery_rate={result['recovery_rate']:.0%} "
f"({result['recovery_scores']})")
Grading
| Component | Points |
|---|---|
| Part A: OL/PR/W operationalization correct; methodology questions answered | 6 |
| Part B: evaluation runs 3 sessions per model; scorecard.json present | 7 |
| Part C: scorecard table; deployment recommendation uses all 3 axes | 7 |
| Part D: extended dimension implemented and run; results reported | 5 |