Classroom Glossary Public page

Module 11: The D8 Methodology in Depth — Full Comparative Evaluation

1,360 words

Duration: 3 hr lecture + 4 hr lab + 6 hr independent
Lab: Lab 11 (Multi-model D8 evaluation: 3 models, full methodology)
MITRE ATLAS tactics: (evaluation and defense framework)
Foundational weave: Mitchell Ch 8 (benchmarks); D8 primary source — "Nine Ollama-Cloud LLMs Walk Into A Signal Group"


11.1 Why Benchmarks Fail Production Operators

Module 2 introduced the D8 evaluation methodology through a single-baseline measurement. Module 11 returns to it at full depth, because understanding what D8 measures -- and why it measures those things instead of MMLU, HellaSwag, or HumanEval -- is the prerequisite for designing production-grade model evaluation strategies.

The standard benchmark critique is easy: leaderboard models don't translate to production performance. The harder question is why. Mitchell Chapter 8 covers the limits of behavioral testing: a model that produces correct answers may be doing so for wrong reasons, and a model that produces incorrect answers on benchmarks may nonetheless behave correctly on the actual deployment task.

The D8 study provided a concrete empirical answer. Nine models were evaluated on a real operational task (running a managed service provider operator shift via Signal) using real operational prompts (service-health triage, incident cross-reference, API key rotation recommendation). The session traces are 47 JSONL files. The results are public. The study found that benchmark rank and operational reliability are not correlated in ways that help MSP operators make deployment decisions.


11.2 The D8 Study Setup

The D8 study (published on sandhillscto.com/insights as "Nine Ollama-Cloud LLMs Walk Into A Signal Group") ran in April 2026 by the whitebox-ops operator (Claude Opus 4.7) with Jon Munson editing. The setup:

The nine models evaluated (all from Ollama Cloud, filtered for declared cloud, tools, and thinking capabilities as of 2026-04-18):

# Model Context window Verdict
1 glm-5.1:cloud 200,000 PASS
2 minimax-m2.7:cloud 200,000 STRONG PASS
3 gemma4:31b-cloud 131,072 PARTIAL
4 nemotron-3-super:cloud 262,144 PASS
5 qwen3.5:397b-cloud 262,144 STRONG PASS
6 kimi-k2.5:cloud 262,144 PASS
7 gemini-3-flash-preview:cloud 1,000,000 STRONG PASS
8 deepseek-v3.2:cloud 163,840 FAIL
9 gpt-oss:120b-cloud 131,072 PASS

The evaluation harness: Each model was wrapped in Claude Code CLI, registered in a Signal group as the active operator, and handed six operational prompts (P0-P5) plus a P6 autonomous-handoff trigger. Prompts were injected via terminalcp stdin using \r as the submit character. Three retest passes per model for parity. 47 JSONL session traces total.

The mock-ops fixtures: Three synthetic files (TODO.md, service-health.json, INCIDENTS.md) backed by a purpose-built MCP server (mock-ops-mcp) exposing eight tools (mock_todo_list, mock_service_health, mock_incidents_list, plus action tools). Using MCP tools rather than raw file reads isolates tool-picking fidelity as the measured axis.


11.3 The Three Evaluation Axes

The D8 study operationalized three axes chosen specifically for their relevance to sustained agentic operation:

Axis A: Operational Lifetime (OL)

What it measures: How many operational turns fit inside the model's context window before it must hand off to a successor generation.

Operationalization: Measure avg_input_tokens_growth_per_turn from real session JSONL traces. Project 80% of the advertised context window as the handoff trigger. Compute OL = (0.80 * context_window) / avg_growth_per_turn.

Results:

Model Context Avg growth/turn OL (turns)
gemini-3-flash-preview 1,000,000 ~2,300 ~351
glm-5.1 200,000 ~2,500 64
minimax-m2.7 200,000 ~2,600 61
kimi-k2.5 262,144 ~3,400 61
nemotron-3-super 262,144 ~3,300 62
qwen3.5:397b 262,144 ~4,600 45
gpt-oss:120b 131,072 ~2,900 36
gemma4:31b 131,072 ~3,300 32

Key finding: Gemini-3-flash's 1M context gives it ~5x the operational lifetime of 200k-class competitors. A shift that requires a handoff every 2 hours on minimax runs for the better part of a day on gemini without a handoff. This is a staffing advantage, not just a latency advantage.

Axis B: Productive Ratio (PR)

What it measures: The fraction of tool calls that are "productive" (advancing the operator's task) vs. "probe" (environment discovery, file re-reading, context checks that consume tokens without advancing the queue).

Operationalization: Classify each tool call by name. Productive: mock-ops MCP tools + signal_send. Probe: Bash, Read, Grep, Glob, Skill.

Results:

Model Total calls Productive Probe PR
nemotron-3-super 34 21 13 61.8%
qwen3.5:397b 110 39 71 35.5%
gemini-3-flash-preview 125 44 81 35.2%
kimi-k2.5 112 39 73 34.8%
gemma4:31b 106 35 71 33.0%
minimax-m2.7 149 48 101 32.2%
glm-5.1 114 35 79 30.7%
gpt-oss:120b 98 17 81 17.3%

Key finding: gpt-oss:120b is the outlier at 17.3% PR -- 83% of its tool calls are probes, repeatedly re-reading files already in context. This burns context window faster (contributing to its short OL of 36 turns) and adds latency tax. Nemotron-3-super leads at 61.8% but with high variance across sessions (43-80% range), indicating inconsistency in when it chooses to probe vs. act.

Axis C: Wordiness (W)

What it measures: The length and emoji density of the model's outbound operator communications (signal_send payloads).

Operationalization: Measure average characters per signal_send call; count emojis per 1,000 characters.

Results:

Model Avg chars/signal_send Emojis/1k chars Style cluster
glm-5.1 770 1.02 Analyst-operator
qwen3.5:397b 539 1.21 Briefing-operator
minimax-m2.7 460 0.58 Briefing-operator
kimi-k2.5 453 1.60 Briefing-operator
gpt-oss:120b 385 0.90 Short-ops-operator
nemotron-3-super 401 0.00 Short-ops-operator
gemini-3-flash-preview 258 0.00 Short-ops-operator
gemma4:31b 208 0.00 Short-ops-operator

Three style clusters: short-ops-operator (under 400 chars, factual, no emojis -- feels like a pager alert); briefing-operator (400-550 chars, structured Markdown -- feels like a stand-up update); analyst-operator (glm-5.1 alone, 800+ chars, tables, multi-paragraph -- feels like a morning report).

The right cluster for a deployment depends on audience expectations, not model capability.


11.4 Tier Assignments and Deployment Recommendations

The D8 study assigns models to three tiers:

Tier 1 (ship to clients):

  • minimax-m2.7:cloud -- fastest median latency, clean on all axes
  • gemini-3-flash-preview:cloud -- 1M context headroom, sub-6s median latency
  • qwen3.5:397b-cloud -- live-traffic-first prioritization philosophy; action-ready disposition

Tier 2 (viable with caveats):

  • glm-5.1:cloud -- depth-of-reasoning leader; emergent cross-file insight not seen in Tier 1 models; 40-50s warm latency is slower than Tier 1
  • kimi-k2.5:cloud -- reliable, no distinctive quirks, mid-tier on all axes
  • gpt-oss:120b-cloud -- lowest PR (17.3%); reliable but works harder than necessary
  • nemotron-3-super:cloud -- highest PR on aggregate, but cross-session consistency issue; handoff not yet verified

Tier 3 (do not ship):

  • gemma4:31b-cloud -- cross-file synthesis weakness; ack-first work-later behavior
  • deepseek-v3.2:cloud -- fails /standup multi-step ritual reliably; hung on P1 and P3 in separate trials

11.5 Dimensions the D8 Study Did Not Measure

The study explicitly documents what it does not cover. Understanding the scope gaps is as important as understanding the results:

  • Long-session context erosion. The 6-prompt sequence consumed 7-35% of each model's context. Real operator shifts run for days.
  • Cold-start behavior. All trials used warm Ollama Cloud inference windows.
  • Concurrent-load behavior. All trials were single-operator.
  • Tool-call error recovery. Mock-ops tools never returned errors.
  • Longer handoff chains. Each trial tested one handoff. Three-hop chains are not covered.

A complete production evaluation must include all five of these. The D8 study is a starting point, not a complete evaluation framework.


11.6 Replicating D8 for a Different Deployment Context

The D8 methodology is domain-agnostic. Any agentic deployment can be evaluated using the same three axes with domain-appropriate fixtures:

class D8Evaluation:
    """Template for replicating D8 methodology on a new deployment."""

    def __init__(self, model_name: str, fixture_dir: str, ops_tools: list[str]):
        self.model = model_name
        self.fixture_dir = fixture_dir
        self.productive_tools = set(ops_tools)   # domain-specific ops tools
        self.probe_tools = {"Bash", "Read", "Grep", "Glob"}
        self.sessions: list[dict] = []

    def run_session(self, prompts: list[str]) -> dict:
        """Run one evaluation session; return per-session metrics."""
        tool_calls = []
        signal_sends = []
        # ... harness-specific implementation ...
        return {
            "tool_calls": tool_calls,
            "signal_sends": signal_sends,
            "input_token_growth_per_turn": self._compute_token_growth(),
        }

    def score_ol(self, context_window: int) -> float:
        avg_growth = sum(s["input_token_growth_per_turn"] for s in self.sessions) / len(self.sessions)
        return (0.80 * context_window) / avg_growth

    def score_pr(self) -> float:
        productive = sum(1 for s in self.sessions for tc in s["tool_calls"]
                         if tc["name"] in self.productive_tools)
        total = sum(len(s["tool_calls"]) for s in self.sessions)
        return productive / total if total else 0.0

    def score_w(self) -> dict:
        all_sends = [send for s in self.sessions for send in s["signal_sends"]]
        avg_chars = sum(len(m) for m in all_sends) / len(all_sends) if all_sends else 0
        return {"avg_chars": avg_chars, "n_sends": len(all_sends)}

Adapt ops_tools to match the domain: for a customer-support deployment, productive tools are the CRM API calls; for a code-review deployment, productive tools are the code analysis calls.


Reflection prompts

  1. Qwen's "live-traffic-first" prioritization differs from the consensus "data-loss-first" priority. In an SLA-driven MSP context, which prioritization philosophy is more aligned with client expectations? Is there a deployment context where qwen's philosophy would be strictly correct?
  2. The D8 study measures PR across the entire session. A model with PR = 30% might have PR = 5% in the first two turns (environment discovery) and PR = 60% in the remaining turns (steady-state operations). Design a per-phase PR metric that would reveal this pattern and explain why it matters for deployment decisions.
  3. Gemini-3-flash's 1M context gives OL = 351 turns. At the trial's observed tool-call rate, how many days of continuous operation does 351 turns represent? What would cause the actual operational lifetime to differ from this estimate?