Lab 11: Multi-Model D8 Evaluation · AI-201 · Virtus Cyber Academy Classroom

Module: 11 — The D8 Methodology in Depth
Points: 25
Time estimate: 4 hr lab + 6 hr independent
Deliverable: lab-11-report.md + lab11/ directory

Objectives

Replicate the D8 methodology operationalization from the primary source article.
Run a 3-model comparative evaluation using promptfoo and Ollama.
Produce a D8-format scorecard with a deployment recommendation.
Identify one evaluation dimension the D8 study did not measure and design the test for it.

Setup

pip install anthropic requests
npm install -g promptfoo
ollama pull llama3.2:3b
ollama pull qwen2.5:7b
ollama pull gemma3:4b

mkdir lab11 && cd lab11

Part A: D8 Methodology Replication (60 min)

The D8 study evaluated 9 Ollama Cloud models across 47 sessions using the three evaluation axes: Operational Lifetime (OL), Productive Ratio (PR), and Wordiness (W). Your first task is to operationalize these axes in code so you can measure them on any model.

A.1: OL Operationalization

Write lab11/metrics.py:

import json
from pathlib import Path
from typing import Any

def compute_ol(session_traces: list[dict], context_window: int) -> float:
    """
    Operational Lifetime: projected turns before context window is 80% full.

    session_traces: list of sessions, each a list of turn dicts with 'input_tokens' key.
    context_window: advertised context window for the model.
    """
    per_session_growth = []
    for session in session_traces:
        turns = [t for t in session if "input_tokens" in t]
        if len(turns) < 2:
            continue
        token_deltas = [
            turns[i]["input_tokens"] - turns[i-1]["input_tokens"]
            for i in range(1, len(turns))
        ]
        avg_growth = sum(token_deltas) / len(token_deltas)
        per_session_growth.append(avg_growth)

    if not per_session_growth:
        return 0.0

    avg_growth = sum(per_session_growth) / len(per_session_growth)
    return (0.80 * context_window) / avg_growth


def compute_pr(session_traces: list[dict], productive_tools: set[str]) -> float:
    """
    Productive Ratio: fraction of tool calls that advance the task.

    productive_tools: set of tool names considered productive (domain-specific).
    """
    productive = 0
    total = 0
    for session in session_traces:
        for turn in session:
            for call in turn.get("tool_calls", []):
                total += 1
                if call["name"] in productive_tools:
                    productive += 1
    return productive / total if total > 0 else 0.0


def compute_w(session_traces: list[dict], output_tool: str = "signal_send") -> dict:
    """
    Wordiness: average characters per outbound communication call.
    """
    all_payloads = []
    for session in session_traces:
        for turn in session:
            for call in turn.get("tool_calls", []):
                if call["name"] == output_tool:
                    payload = call.get("args", {}).get("message", "")
                    all_payloads.append(payload)

    if not all_payloads:
        return {"avg_chars": 0, "n_sends": 0, "style_cluster": "unknown"}

    avg_chars = sum(len(p) for p in all_payloads) / len(all_payloads)

    if avg_chars < 400:
        cluster = "short-ops-operator"
    elif avg_chars < 550:
        cluster = "briefing-operator"
    else:
        cluster = "analyst-operator"

    return {
        "avg_chars": round(avg_chars, 1),
        "n_sends": len(all_payloads),
        "style_cluster": cluster,
    }


def score_verdict(ol: float, pr: float) -> str:
    """
    D8 PASS/PARTIAL/FAIL assignment.

    STRONG PASS: OL >= 45 AND PR >= 35%
    PASS:        OL >= 30 AND PR >= 25%
    PARTIAL:     OL >= 20 AND PR >= 20%
    FAIL:        anything else
    """
    if ol >= 45 and pr >= 0.35:
        return "STRONG PASS"
    if ol >= 30 and pr >= 0.25:
        return "PASS"
    if ol >= 20 and pr >= 0.20:
        return "PARTIAL"
    return "FAIL"

A.2: Methodology Documentation

In lab-11-report.md, answer:

The D8 study formula for OL is (0.80 × context_window) / avg_growth_per_turn. Why 80% and not 100%? What operational consideration does the 20% headroom represent?
The PR metric distinguishes "productive" from "probe" tool calls. Give two examples of probe tool calls from the D8 study and explain why they are classified as non-productive despite being necessary for the agent to function.
The D8 study found three style clusters for Wordiness. Name them, give the character-count range for each, and describe the communication context where each cluster is most appropriate.

Part B: 3-Model Comparative Evaluation (90 min)

B.1: Session Harness

Write lab11/harness.py — a lightweight session runner that feeds prompts to an Ollama model and logs token growth and tool calls:

import json, time, requests
from pathlib import Path
from dataclasses import dataclass, field

@dataclass
class TurnRecord:
    turn:          int
    prompt:        str
    response:      str
    input_tokens:  int
    output_tokens: int
    tool_calls:    list[dict] = field(default_factory=list)


def run_session(
    model: str,
    prompts: list[str],
    system_prompt: str = "",
) -> list[dict]:
    """
    Run a sequence of prompts through an Ollama model.
    Returns a list of TurnRecord dicts.
    """
    url = "http://localhost:11434/api/chat"
    history = []
    if system_prompt:
        history.append({"role": "system", "content": system_prompt})

    records = []
    for i, prompt in enumerate(prompts):
        history.append({"role": "user", "content": prompt})
        payload = {
            "model": model,
            "messages": history,
            "stream": False,
        }
        resp = requests.post(url, json=payload, timeout=120)
        resp.raise_for_status()
        data = resp.json()

        content   = data["message"]["content"]
        in_tokens = data.get("prompt_eval_count", 0)
        out_tokens = data.get("eval_count", 0)

        # Detect simple tool-call patterns in response for PR scoring
        tool_calls = []
        if "```bash" in content or "```python" in content:
            tool_calls.append({"name": "code_block"})
        if content.strip().startswith("{") or content.strip().startswith("["):
            tool_calls.append({"name": "json_output"})

        records.append({
            "turn":         i,
            "prompt":       prompt[:100],
            "response":     content[:200],
            "input_tokens": in_tokens,
            "output_tokens": out_tokens,
            "tool_calls":   tool_calls,
        })
        history.append({"role": "assistant", "content": content})
        time.sleep(0.5)

    return records

B.2: Evaluation Prompts

Write lab11/eval_prompts.py — 6 prompts designed to exercise agentic reasoning in a security analysis context:

SYSTEM_PROMPT = """You are an AI security analyst assistant. 
You help analyze vulnerabilities, map attack techniques to MITRE ATLAS, 
and produce structured security assessments.
Be concise. Produce structured output (JSON, tables, or numbered lists) 
when summarizing findings."""

PROMPTS = [
    # P0: Tool triage
    "List the 3 most important tools you would use to analyze a suspicious Python "
    "package before installing it. For each tool, give its name and a one-sentence "
    "description of what it detects.",

    # P1: CVE classification
    "A CVE description reads: 'Remote code execution via pickle deserialization in "
    "model loading endpoint. CVSS 9.3.' Classify this CVE by MITRE ATLAS tactic "
    "and technique. Return JSON with keys: tactic_id, tactic_name, technique_id, "
    "technique_name, rationale.",

    # P2: Threat modeling
    "A LangChain application uses a vector store populated from customer-uploaded "
    "PDFs. Name the top 3 ATLAS techniques an attacker could use against this "
    "architecture. Return a table with columns: Technique ID, Name, Attack Path.",

    # P3: Cross-reference
    "Cross-reference the following two findings: (1) AML.T0020 Poison Training Data "
    "via upstream dependency; (2) AML.T0048 Backdoor ML Model via model file upload. "
    "What kill-chain step connects them? Return a numbered list of steps.",

    # P4: Remediation
    "A security audit found that a production LLM pipeline fetches URLs from "
    "LLM-generated content without validation. Write the Python allow-list function "
    "that would block this attack. Include DNS rebinding protection.",

    # P5: Briefing
    "Summarize the 5 highest-impact ATLAS techniques for a team deploying a RAG-based "
    "customer support chatbot. Write the summary for a non-technical CISO. Under 150 "
    "words.",
]

B.3: Run Three Models

Write lab11/run_evaluation.py:

import json
from pathlib import Path
from harness import run_session
from eval_prompts import SYSTEM_PROMPT, PROMPTS
from metrics import compute_ol, compute_pr, compute_w, score_verdict

MODELS = [
    ("llama3.2:3b",  131_072),
    ("qwen2.5:7b",   131_072),
    ("gemma3:4b",    128_000),
]

# Tools we consider "productive" in this security-analyst context
PRODUCTIVE_TOOLS = {"code_block", "json_output"}

results = {}

for model_name, context_window in MODELS:
    print(f"\n=== Running {model_name} ===")
    # 3 sessions per model (D8 uses 3 retest passes)
    sessions = []
    for session_num in range(3):
        print(f"  Session {session_num + 1}/3...")
        records = run_session(model_name, PROMPTS, system_prompt=SYSTEM_PROMPT)
        sessions.append(records)

    ol = compute_ol(sessions, context_window)
    pr = compute_pr(sessions, PRODUCTIVE_TOOLS)
    w  = compute_w(sessions, output_tool="json_output")

    results[model_name] = {
        "context_window": context_window,
        "ol":      round(ol, 1),
        "pr":      round(pr, 3),
        "w":       w,
        "verdict": score_verdict(ol, pr),
    }
    print(f"  OL={ol:.1f}  PR={pr:.1%}  W={w['avg_chars']:.0f}  -> {results[model_name]['verdict']}")

Path("scorecard.json").write_text(json.dumps(results, indent=2))
print("\nScorecard written to scorecard.json")

Run it:

cd lab11
python3 run_evaluation.py

Expected runtime: 20-40 minutes on CPU depending on hardware.

Part C: D8 Scorecard and Deployment Recommendation (45 min)

Write lab11/print_scorecard.py:

import json
from pathlib import Path

data = json.loads(Path("scorecard.json").read_text())

print("=" * 72)
print(f"{'Model':<25} {'Context':>10} {'OL':>8} {'PR':>8} {'W (chars)':>10} {'Verdict'}")
print("=" * 72)

for model, r in sorted(data.items(), key=lambda x: -x[1]["ol"]):
    print(
        f"{model:<25} {r['context_window']:>10,} "
        f"{r['ol']:>8.1f} {r['pr']:>7.1%} "
        f"{r['w']['avg_chars']:>10.0f} "
        f"  {r['verdict']}"
    )

print("=" * 72)
print(f"\nStyle clusters:")
for model, r in data.items():
    print(f"  {model}: {r['w']['style_cluster']}")

In lab-11-report.md:

Paste your scorecard table (OL / PR / W / Verdict for all 3 models).
Write a deployment recommendation for the following use case: An MSP running overnight shift coverage; the agent processes 30 service-health alerts per shift; sessions are expected to run 4-6 hours without handoff; communications go to a human on-call who prefers concise pager-style alerts. Which model do you select? Justify using all three D8 axes.
D8 study's Tier 3 conclusion for deepseek-v3.2 was FAIL because it "hangs on multi-step P1 and P3." Your evaluation did not include a multi-step hang test. Does the absence of this test affect the validity of your scorecard? Explain.

Part D: Unmeasured Dimension (45 min)

The D8 study explicitly lists five dimensions it did not measure:

Long-session context erosion (>7-35% of context consumed in early trial)
Cold-start behavior (all trials used warm inference windows)
Concurrent-load behavior (all trials were single-operator)
Tool-call error recovery (mock tools never returned errors)
Longer handoff chains (only single-hop handoffs tested)

Choose one of the five unmeasured dimensions. Design and implement a test for it.

Required deliverables:

lab11/extended_test.py — Implementation of your extended test
A section in lab-11-report.md (under Part D: Extended Dimension):
- Which dimension you chose and why it matters for the MSP deployment context from Part C
- Your test design: what you inject or vary, what you measure, how you score the result
- Your results on at least 2 of the 3 models from Part B

Starter template for error-recovery testing:

# lab11/extended_test.py — Error Recovery Extension
# Modify run_session() to inject tool errors and measure recovery behavior

import json, time, requests
from pathlib import Path

def run_error_recovery_session(model: str, prompts: list[str]) -> dict:
    """
    Inject a simulated tool error at turn 3 and measure whether the model
    recovers its task state by turn 5.

    Error injection: append "ERROR: tool call failed (timeout)" to turn 3 response.
    Recovery metric: does the model re-attempt the task or stall?
    """
    url = "http://localhost:11434/api/chat"
    history = []
    ERROR_TURN = 2   # 0-indexed

    recovery_scores = []

    for i, prompt in enumerate(prompts):
        history.append({"role": "user", "content": prompt})

        if i == ERROR_TURN:
            # Inject error as a fake system message
            history.append({
                "role":    "assistant",
                "content": "I'll look that up now. [TOOL_ERROR: timeout on service-health check]",
            })
            history.append({
                "role":    "user",
                "content": "The tool returned an error. Please retry.",
            })

        payload = {"model": model, "messages": history, "stream": False}
        resp = requests.post(url, json=payload, timeout=120)
        data = resp.json()
        content = data["message"]["content"]
        history.append({"role": "assistant", "content": content})

        # Score recovery: did the model re-attempt or give up?
        if i > ERROR_TURN:
            task_keywords = ["vulnerability", "ATLAS", "CVE", "technique", "attack"]
            recovered = any(k.lower() in content.lower() for k in task_keywords)
            recovery_scores.append(int(recovered))

        time.sleep(0.5)

    return {
        "model":           model,
        "recovery_scores": recovery_scores,
        "recovery_rate":   sum(recovery_scores) / len(recovery_scores) if recovery_scores else 0.0,
    }


if __name__ == "__main__":
    from eval_prompts import PROMPTS

    models_to_test = ["llama3.2:3b", "qwen2.5:7b"]
    for model in models_to_test:
        result = run_error_recovery_session(model, PROMPTS)
        print(f"{model}: recovery_rate={result['recovery_rate']:.0%} "
              f"({result['recovery_scores']})")

Grading

Component	Points
Part A: OL/PR/W operationalization correct; methodology questions answered	6
Part B: evaluation runs 3 sessions per model; scorecard.json present	7
Part C: scorecard table; deployment recommendation uses all 3 axes	7
Part D: extended dimension implemented and run; results reported	5