Duration: 3 hr lecture + 4 hr lab + 6 hr independent
Lab: Lab 11 (Multi-model D8 evaluation: 3 models, full methodology)
MITRE ATLAS tactics: (evaluation and defense framework)
Foundational weave: Mitchell Ch 8 (benchmarks); D8 primary source — "Nine Ollama-Cloud LLMs Walk Into A Signal Group"
11.1 Why Benchmarks Fail Production Operators
Module 2 introduced the D8 evaluation methodology through a single-baseline measurement. Module 11 returns to it at full depth, because understanding what D8 measures -- and why it measures those things instead of MMLU, HellaSwag, or HumanEval -- is the prerequisite for designing production-grade model evaluation strategies.
The standard benchmark critique is easy: leaderboard models don't translate to production performance. The harder question is why. Mitchell Chapter 8 covers the limits of behavioral testing: a model that produces correct answers may be doing so for wrong reasons, and a model that produces incorrect answers on benchmarks may nonetheless behave correctly on the actual deployment task.
The D8 study provided a concrete empirical answer. Nine models were evaluated on a real operational task (running a managed service provider operator shift via Signal) using real operational prompts (service-health triage, incident cross-reference, API key rotation recommendation). The session traces are 47 JSONL files. The results are public. The study found that benchmark rank and operational reliability are not correlated in ways that help MSP operators make deployment decisions.
11.2 The D8 Study Setup
The D8 study (published on sandhillscto.com/insights as "Nine Ollama-Cloud LLMs Walk Into A Signal Group") ran in April 2026 by the whitebox-ops operator (Claude Opus 4.7) with Jon Munson editing. The setup:
The nine models evaluated (all from Ollama Cloud, filtered for declared cloud, tools, and thinking capabilities as of 2026-04-18):
| # | Model | Context window | Verdict |
|---|---|---|---|
| 1 | glm-5.1:cloud | 200,000 | PASS |
| 2 | minimax-m2.7:cloud | 200,000 | STRONG PASS |
| 3 | gemma4:31b-cloud | 131,072 | PARTIAL |
| 4 | nemotron-3-super:cloud | 262,144 | PASS |
| 5 | qwen3.5:397b-cloud | 262,144 | STRONG PASS |
| 6 | kimi-k2.5:cloud | 262,144 | PASS |
| 7 | gemini-3-flash-preview:cloud | 1,000,000 | STRONG PASS |
| 8 | deepseek-v3.2:cloud | 163,840 | FAIL |
| 9 | gpt-oss:120b-cloud | 131,072 | PASS |
The evaluation harness: Each model was wrapped in Claude Code CLI, registered in a Signal group as the active operator, and handed six operational prompts (P0-P5) plus a P6 autonomous-handoff trigger. Prompts were injected via terminalcp stdin using \r as the submit character. Three retest passes per model for parity. 47 JSONL session traces total.
The mock-ops fixtures: Three synthetic files (TODO.md, service-health.json, INCIDENTS.md) backed by a purpose-built MCP server (mock-ops-mcp) exposing eight tools (mock_todo_list, mock_service_health, mock_incidents_list, plus action tools). Using MCP tools rather than raw file reads isolates tool-picking fidelity as the measured axis.
11.3 The Three Evaluation Axes
The D8 study operationalized three axes chosen specifically for their relevance to sustained agentic operation:
Axis A: Operational Lifetime (OL)
What it measures: How many operational turns fit inside the model's context window before it must hand off to a successor generation.
Operationalization: Measure avg_input_tokens_growth_per_turn from real session JSONL traces. Project 80% of the advertised context window as the handoff trigger. Compute OL = (0.80 * context_window) / avg_growth_per_turn.
Results:
| Model | Context | Avg growth/turn | OL (turns) |
|---|---|---|---|
| gemini-3-flash-preview | 1,000,000 | ~2,300 | ~351 |
| glm-5.1 | 200,000 | ~2,500 | 64 |
| minimax-m2.7 | 200,000 | ~2,600 | 61 |
| kimi-k2.5 | 262,144 | ~3,400 | 61 |
| nemotron-3-super | 262,144 | ~3,300 | 62 |
| qwen3.5:397b | 262,144 | ~4,600 | 45 |
| gpt-oss:120b | 131,072 | ~2,900 | 36 |
| gemma4:31b | 131,072 | ~3,300 | 32 |
Key finding: Gemini-3-flash's 1M context gives it ~5x the operational lifetime of 200k-class competitors. A shift that requires a handoff every 2 hours on minimax runs for the better part of a day on gemini without a handoff. This is a staffing advantage, not just a latency advantage.
Axis B: Productive Ratio (PR)
What it measures: The fraction of tool calls that are "productive" (advancing the operator's task) vs. "probe" (environment discovery, file re-reading, context checks that consume tokens without advancing the queue).
Operationalization: Classify each tool call by name. Productive: mock-ops MCP tools + signal_send. Probe: Bash, Read, Grep, Glob, Skill.
Results:
| Model | Total calls | Productive | Probe | PR |
|---|---|---|---|---|
| nemotron-3-super | 34 | 21 | 13 | 61.8% |
| qwen3.5:397b | 110 | 39 | 71 | 35.5% |
| gemini-3-flash-preview | 125 | 44 | 81 | 35.2% |
| kimi-k2.5 | 112 | 39 | 73 | 34.8% |
| gemma4:31b | 106 | 35 | 71 | 33.0% |
| minimax-m2.7 | 149 | 48 | 101 | 32.2% |
| glm-5.1 | 114 | 35 | 79 | 30.7% |
| gpt-oss:120b | 98 | 17 | 81 | 17.3% |
Key finding: gpt-oss:120b is the outlier at 17.3% PR -- 83% of its tool calls are probes, repeatedly re-reading files already in context. This burns context window faster (contributing to its short OL of 36 turns) and adds latency tax. Nemotron-3-super leads at 61.8% but with high variance across sessions (43-80% range), indicating inconsistency in when it chooses to probe vs. act.
Axis C: Wordiness (W)
What it measures: The length and emoji density of the model's outbound operator communications (signal_send payloads).
Operationalization: Measure average characters per signal_send call; count emojis per 1,000 characters.
Results:
| Model | Avg chars/signal_send | Emojis/1k chars | Style cluster |
|---|---|---|---|
| glm-5.1 | 770 | 1.02 | Analyst-operator |
| qwen3.5:397b | 539 | 1.21 | Briefing-operator |
| minimax-m2.7 | 460 | 0.58 | Briefing-operator |
| kimi-k2.5 | 453 | 1.60 | Briefing-operator |
| gpt-oss:120b | 385 | 0.90 | Short-ops-operator |
| nemotron-3-super | 401 | 0.00 | Short-ops-operator |
| gemini-3-flash-preview | 258 | 0.00 | Short-ops-operator |
| gemma4:31b | 208 | 0.00 | Short-ops-operator |
Three style clusters: short-ops-operator (under 400 chars, factual, no emojis -- feels like a pager alert); briefing-operator (400-550 chars, structured Markdown -- feels like a stand-up update); analyst-operator (glm-5.1 alone, 800+ chars, tables, multi-paragraph -- feels like a morning report).
The right cluster for a deployment depends on audience expectations, not model capability.
11.4 Tier Assignments and Deployment Recommendations
The D8 study assigns models to three tiers:
Tier 1 (ship to clients):
- minimax-m2.7:cloud -- fastest median latency, clean on all axes
- gemini-3-flash-preview:cloud -- 1M context headroom, sub-6s median latency
- qwen3.5:397b-cloud -- live-traffic-first prioritization philosophy; action-ready disposition
Tier 2 (viable with caveats):
- glm-5.1:cloud -- depth-of-reasoning leader; emergent cross-file insight not seen in Tier 1 models; 40-50s warm latency is slower than Tier 1
- kimi-k2.5:cloud -- reliable, no distinctive quirks, mid-tier on all axes
- gpt-oss:120b-cloud -- lowest PR (17.3%); reliable but works harder than necessary
- nemotron-3-super:cloud -- highest PR on aggregate, but cross-session consistency issue; handoff not yet verified
Tier 3 (do not ship):
- gemma4:31b-cloud -- cross-file synthesis weakness; ack-first work-later behavior
- deepseek-v3.2:cloud -- fails
/standupmulti-step ritual reliably; hung on P1 and P3 in separate trials
11.5 Dimensions the D8 Study Did Not Measure
The study explicitly documents what it does not cover. Understanding the scope gaps is as important as understanding the results:
- Long-session context erosion. The 6-prompt sequence consumed 7-35% of each model's context. Real operator shifts run for days.
- Cold-start behavior. All trials used warm Ollama Cloud inference windows.
- Concurrent-load behavior. All trials were single-operator.
- Tool-call error recovery. Mock-ops tools never returned errors.
- Longer handoff chains. Each trial tested one handoff. Three-hop chains are not covered.
A complete production evaluation must include all five of these. The D8 study is a starting point, not a complete evaluation framework.
11.6 Replicating D8 for a Different Deployment Context
The D8 methodology is domain-agnostic. Any agentic deployment can be evaluated using the same three axes with domain-appropriate fixtures:
class D8Evaluation:
"""Template for replicating D8 methodology on a new deployment."""
def __init__(self, model_name: str, fixture_dir: str, ops_tools: list[str]):
self.model = model_name
self.fixture_dir = fixture_dir
self.productive_tools = set(ops_tools) # domain-specific ops tools
self.probe_tools = {"Bash", "Read", "Grep", "Glob"}
self.sessions: list[dict] = []
def run_session(self, prompts: list[str]) -> dict:
"""Run one evaluation session; return per-session metrics."""
tool_calls = []
signal_sends = []
# ... harness-specific implementation ...
return {
"tool_calls": tool_calls,
"signal_sends": signal_sends,
"input_token_growth_per_turn": self._compute_token_growth(),
}
def score_ol(self, context_window: int) -> float:
avg_growth = sum(s["input_token_growth_per_turn"] for s in self.sessions) / len(self.sessions)
return (0.80 * context_window) / avg_growth
def score_pr(self) -> float:
productive = sum(1 for s in self.sessions for tc in s["tool_calls"]
if tc["name"] in self.productive_tools)
total = sum(len(s["tool_calls"]) for s in self.sessions)
return productive / total if total else 0.0
def score_w(self) -> dict:
all_sends = [send for s in self.sessions for send in s["signal_sends"]]
avg_chars = sum(len(m) for m in all_sends) / len(all_sends) if all_sends else 0
return {"avg_chars": avg_chars, "n_sends": len(all_sends)}
Adapt ops_tools to match the domain: for a customer-support deployment, productive tools are the CRM API calls; for a code-review deployment, productive tools are the code analysis calls.
Reflection prompts
- Qwen's "live-traffic-first" prioritization differs from the consensus "data-loss-first" priority. In an SLA-driven MSP context, which prioritization philosophy is more aligned with client expectations? Is there a deployment context where qwen's philosophy would be strictly correct?
- The D8 study measures PR across the entire session. A model with PR = 30% might have PR = 5% in the first two turns (environment discovery) and PR = 60% in the remaining turns (steady-state operations). Design a per-phase PR metric that would reveal this pattern and explain why it matters for deployment decisions.
- Gemini-3-flash's 1M context gives OL = 351 turns. At the trial's observed tool-call rate, how many days of continuous operation does 351 turns represent? What would cause the actual operational lifetime to differ from this estimate?