Module 11: The D8 Methodology in Depth — Full Comparative Evaluation · AI-201

Duration: 3 hr lecture + 4 hr lab + 6 hr independent
Lab: Lab 11 (Multi-model D8 evaluation: 3 models, full methodology)
MITRE ATLAS tactics: (evaluation and defense framework)
Foundational weave: Mitchell Ch 8 (benchmarks); D8 primary source — "Nine Ollama-Cloud LLMs Walk Into A Signal Group"

11.1 Why Benchmarks Fail Production Operators

Module 2 introduced the D8 evaluation methodology through a single-baseline measurement. Module 11 returns to it at full depth, because understanding what D8 measures -- and why it measures those things instead of MMLU, HellaSwag, or HumanEval -- is the prerequisite for designing production-grade model evaluation strategies.

The standard benchmark critique is easy: leaderboard models don't translate to production performance. The harder question is why. Mitchell Chapter 8 covers the limits of behavioral testing: a model that produces correct answers may be doing so for wrong reasons, and a model that produces incorrect answers on benchmarks may nonetheless behave correctly on the actual deployment task.

The D8 study provided a concrete empirical answer. Nine models were evaluated on a real operational task (running a managed service provider operator shift via Signal) using real operational prompts (service-health triage, incident cross-reference, API key rotation recommendation). The session traces are 47 JSONL files. The results are public. The study found that benchmark rank and operational reliability are not correlated in ways that help MSP operators make deployment decisions.

11.2 The D8 Study Setup

The D8 study (published on sandhillscto.com/insights as "Nine Ollama-Cloud LLMs Walk Into A Signal Group") ran in April 2026 by the whitebox-ops operator (Claude Opus 4.7) with Jon Munson editing. The setup:

The nine models evaluated (all from Ollama Cloud, filtered for declared cloud, tools, and thinking capabilities as of 2026-04-18):

#	Model	Context window	Verdict
1	glm-5.1:cloud	200,000	PASS
2	minimax-m2.7:cloud	200,000	STRONG PASS
3	gemma4:31b-cloud	131,072	PARTIAL
4	nemotron-3-super:cloud	262,144	PASS
5	qwen3.5:397b-cloud	262,144	STRONG PASS
6	kimi-k2.5:cloud	262,144	PASS
7	gemini-3-flash-preview:cloud	1,000,000	STRONG PASS
8	deepseek-v3.2:cloud	163,840	FAIL
9	gpt-oss:120b-cloud	131,072	PASS

The evaluation harness: Each model was wrapped in Claude Code CLI, registered in a Signal group as the active operator, and handed six operational prompts (P0-P5) plus a P6 autonomous-handoff trigger. Prompts were injected via terminalcp stdin using \r as the submit character. Three retest passes per model for parity. 47 JSONL session traces total.

The mock-ops fixtures: Three synthetic files (TODO.md, service-health.json, INCIDENTS.md) backed by a purpose-built MCP server (mock-ops-mcp) exposing eight tools (mock_todo_list, mock_service_health, mock_incidents_list, plus action tools). Using MCP tools rather than raw file reads isolates tool-picking fidelity as the measured axis.

11.3 The Three Evaluation Axes

The D8 study operationalized three axes chosen specifically for their relevance to sustained agentic operation:

Axis A: Operational Lifetime (OL)

What it measures: How many operational turns fit inside the model's context window before it must hand off to a successor generation.

Operationalization: Measure avg_input_tokens_growth_per_turn from real session JSONL traces. Project 80% of the advertised context window as the handoff trigger. Compute OL = (0.80 * context_window) / avg_growth_per_turn.

Results:

Model	Context	Avg growth/turn	OL (turns)
gemini-3-flash-preview	1,000,000	~2,300	~351
glm-5.1	200,000	~2,500	64
minimax-m2.7	200,000	~2,600	61
kimi-k2.5	262,144	~3,400	61
nemotron-3-super	262,144	~3,300	62
qwen3.5:397b	262,144	~4,600	45
gpt-oss:120b	131,072	~2,900	36
gemma4:31b	131,072	~3,300	32

Key finding: Gemini-3-flash's 1M context gives it ~5x the operational lifetime of 200k-class competitors. A shift that requires a handoff every 2 hours on minimax runs for the better part of a day on gemini without a handoff. This is a staffing advantage, not just a latency advantage.

Axis B: Productive Ratio (PR)

What it measures: The fraction of tool calls that are "productive" (advancing the operator's task) vs. "probe" (environment discovery, file re-reading, context checks that consume tokens without advancing the queue).

Operationalization: Classify each tool call by name. Productive: mock-ops MCP tools + signal_send. Probe: Bash, Read, Grep, Glob, Skill.

Results:

Model	Total calls	Productive	Probe	PR
nemotron-3-super	34	21	13	61.8%
qwen3.5:397b	110	39	71	35.5%
gemini-3-flash-preview	125	44	81	35.2%
kimi-k2.5	112	39	73	34.8%
gemma4:31b	106	35	71	33.0%
minimax-m2.7	149	48	101	32.2%
glm-5.1	114	35	79	30.7%
gpt-oss:120b	98	17	81	17.3%

Key finding: gpt-oss:120b is the outlier at 17.3% PR -- 83% of its tool calls are probes, repeatedly re-reading files already in context. This burns context window faster (contributing to its short OL of 36 turns) and adds latency tax. Nemotron-3-super leads at 61.8% but with high variance across sessions (43-80% range), indicating inconsistency in when it chooses to probe vs. act.

Axis C: Wordiness (W)

What it measures: The length and emoji density of the model's outbound operator communications (signal_send payloads).

Operationalization: Measure average characters per signal_send call; count emojis per 1,000 characters.

Results:

Model	Avg chars/signal_send	Emojis/1k chars	Style cluster
glm-5.1	770	1.02	Analyst-operator
qwen3.5:397b	539	1.21	Briefing-operator
minimax-m2.7	460	0.58	Briefing-operator
kimi-k2.5	453	1.60	Briefing-operator
gpt-oss:120b	385	0.90	Short-ops-operator
nemotron-3-super	401	0.00	Short-ops-operator
gemini-3-flash-preview	258	0.00	Short-ops-operator
gemma4:31b	208	0.00	Short-ops-operator

Three style clusters: short-ops-operator (under 400 chars, factual, no emojis -- feels like a pager alert); briefing-operator (400-550 chars, structured Markdown -- feels like a stand-up update); analyst-operator (glm-5.1 alone, 800+ chars, tables, multi-paragraph -- feels like a morning report).

The right cluster for a deployment depends on audience expectations, not model capability.

11.4 Tier Assignments and Deployment Recommendations

The D8 study assigns models to three tiers:

Tier 1 (ship to clients):

minimax-m2.7:cloud -- fastest median latency, clean on all axes
gemini-3-flash-preview:cloud -- 1M context headroom, sub-6s median latency
qwen3.5:397b-cloud -- live-traffic-first prioritization philosophy; action-ready disposition

Tier 2 (viable with caveats):

glm-5.1:cloud -- depth-of-reasoning leader; emergent cross-file insight not seen in Tier 1 models; 40-50s warm latency is slower than Tier 1
kimi-k2.5:cloud -- reliable, no distinctive quirks, mid-tier on all axes
gpt-oss:120b-cloud -- lowest PR (17.3%); reliable but works harder than necessary
nemotron-3-super:cloud -- highest PR on aggregate, but cross-session consistency issue; handoff not yet verified

Tier 3 (do not ship):

gemma4:31b-cloud -- cross-file synthesis weakness; ack-first work-later behavior
deepseek-v3.2:cloud -- fails /standup multi-step ritual reliably; hung on P1 and P3 in separate trials

11.5 Dimensions the D8 Study Did Not Measure

The study explicitly documents what it does not cover. Understanding the scope gaps is as important as understanding the results:

Long-session context erosion. The 6-prompt sequence consumed 7-35% of each model's context. Real operator shifts run for days.
Cold-start behavior. All trials used warm Ollama Cloud inference windows.
Concurrent-load behavior. All trials were single-operator.
Tool-call error recovery. Mock-ops tools never returned errors.
Longer handoff chains. Each trial tested one handoff. Three-hop chains are not covered.

A complete production evaluation must include all five of these. The D8 study is a starting point, not a complete evaluation framework.

11.6 Replicating D8 for a Different Deployment Context

The D8 methodology is domain-agnostic. Any agentic deployment can be evaluated using the same three axes with domain-appropriate fixtures:

class D8Evaluation:
    """Template for replicating D8 methodology on a new deployment."""

    def __init__(self, model_name: str, fixture_dir: str, ops_tools: list[str]):
        self.model = model_name
        self.fixture_dir = fixture_dir
        self.productive_tools = set(ops_tools)   # domain-specific ops tools
        self.probe_tools = {"Bash", "Read", "Grep", "Glob"}
        self.sessions: list[dict] = []

    def run_session(self, prompts: list[str]) -> dict:
        """Run one evaluation session; return per-session metrics."""
        tool_calls = []
        signal_sends = []
        # ... harness-specific implementation ...
        return {
            "tool_calls": tool_calls,
            "signal_sends": signal_sends,
            "input_token_growth_per_turn": self._compute_token_growth(),
        }

    def score_ol(self, context_window: int) -> float:
        avg_growth = sum(s["input_token_growth_per_turn"] for s in self.sessions) / len(self.sessions)
        return (0.80 * context_window) / avg_growth

    def score_pr(self) -> float:
        productive = sum(1 for s in self.sessions for tc in s["tool_calls"]
                         if tc["name"] in self.productive_tools)
        total = sum(len(s["tool_calls"]) for s in self.sessions)
        return productive / total if total else 0.0

    def score_w(self) -> dict:
        all_sends = [send for s in self.sessions for send in s["signal_sends"]]
        avg_chars = sum(len(m) for m in all_sends) / len(all_sends) if all_sends else 0
        return {"avg_chars": avg_chars, "n_sends": len(all_sends)}

Adapt ops_tools to match the domain: for a customer-support deployment, productive tools are the CRM API calls; for a code-review deployment, productive tools are the code analysis calls.

Reflection prompts

Qwen's "live-traffic-first" prioritization differs from the consensus "data-loss-first" priority. In an SLA-driven MSP context, which prioritization philosophy is more aligned with client expectations? Is there a deployment context where qwen's philosophy would be strictly correct?
The D8 study measures PR across the entire session. A model with PR = 30% might have PR = 5% in the first two turns (environment discovery) and PR = 60% in the remaining turns (steady-state operations). Design a per-phase PR metric that would reveal this pattern and explain why it matters for deployment decisions.
Gemini-3-flash's 1M context gives OL = 351 turns. At the trial's observed tool-call rate, how many days of continuous operation does 351 turns represent? What would cause the actual operational lifetime to differ from this estimate?