Classroom Glossary Public page

Lab 2: DVLA Baseline — Establishing an LLM Evaluation Regression Suite

395 words

Module: 2 — Evaluation Methodology (D8 Framework)
Points: 20
Time estimate: 2 hr lab + 2 hr independent
Deliverable: lab-2-report.md + dvla-baseline/ directory with promptfoo config and results


Objectives

  1. Install and configure Promptfoo against a local Ollama endpoint.
  2. Design a 3-pass evaluation suite modeled on the D8 methodology.
  3. Capture a baseline PASS/PARTIAL/FAIL scorecard for one model.
  4. Understand what a drift-detecting regression suite looks like.

Setup

Verify these are working before starting:

# Ollama running
ollama list          # should show at least one model

# Promptfoo installed
npx promptfoo@latest --version

# Target model available
ollama pull llama3.2:3b   # or use whatever model you have from SETUP.md

If Ollama is unavailable, use the OpenAI-compatible endpoint in SETUP.md. The promptfoo config supports both.


Part A: Build the Evaluation Config (45 min)

Create a directory dvla-baseline/ and inside it a promptfooconfig.yaml. The config must evaluate one model across three passes:

Pass 1: Operational Lifetime (OL)
Does the model stay coherent over a long session? Does it forget context?

# promptfooconfig.yaml skeleton -- fill in the blanks

providers:
  - id: ollama:llama3.2:3b
    config:
      temperature: 0.0

prompts:
  - id: pass1_ol
    raw: |
      You are a security analysis assistant. Remember the following case number: CASE-7741.
      Acknowledge the case number and confirm you will track it throughout this session.
  - id: pass1_ol_recall
    raw: |
      What case number did I give you at the start of this session?
  # Add 3 more OL test prompts of your own design

tests:
  - description: "OL1 - initial case number acknowledgment"
    vars:
      prompt_id: pass1_ol
    assert:
      - type: contains
        value: "CASE-7741"
  - description: "OL1 - recall after context"
    vars:
      prompt_id: pass1_ol_recall
    assert:
      - type: contains
        value: "CASE-7741"
  # Add your 3 OL tests here

Pass 2: Productive Ratio (PR)
What fraction of the model's output is actionable vs. hedging/filler?

Design 5 prompts that ask for specific technical outputs (command syntax, code snippets, decision trees). The evaluation criterion: the response contains at least one concrete, actionable artifact (a command, a code block, a specific number).

  # Pass 2 example
  - description: "PR1 - command output"
    assert:
      - type: contains-any
        value: ["```", "sudo", "curl", "$"]   # evidence of concrete output

Pass 3: Wordiness (W)
Does the model pad responses? The D8 methodology penalizes models that require heavy post-processing.

Design 3 prompts with a word-count constraint in the prompt itself ("Answer in 50 words or fewer."). Evaluate whether the model respects the constraint.

  - description: "W1 - respects 50-word constraint"
    assert:
      - type: javascript
        value: "output.split(' ').length <= 60"  # 20% slack

Your final config must have:

  • At least 5 Pass 1 tests
  • At least 5 Pass 2 tests
  • At least 3 Pass 3 tests
  • Total: ≥13 tests

Part B: Run and Score (45 min)

Run your evaluation suite:

cd dvla-baseline/
npx promptfoo@latest eval --config promptfooconfig.yaml --output results.json
npx promptfoo@latest view   # opens the results dashboard

Capture a screenshot of the results dashboard or export the JSON. Then complete the scorecard:

## Baseline Scorecard

Model: llama3.2:3b (or your chosen model)
Date: YYYY-MM-DD
Total tests: N

### Pass 1: Operational Lifetime
- Tests run: N
- PASS: N  PARTIAL: N  FAIL: N
- OL score: N/5
- Notable failure: [describe any test that failed and why]

### Pass 2: Productive Ratio
- Tests run: N
- PASS: N  PARTIAL: N  FAIL: N
- PR score: N/5
- Notable failure: [describe]

### Pass 3: Wordiness
- Tests run: N
- PASS: N  PARTIAL: N  FAIL: N
- W score: N/3

### Overall verdict: PASS | PARTIAL | FAIL
(PASS = all three dimensions ≥70%; PARTIAL = one dimension <70%; FAIL = two or more <70%)

Part C: Regression Design (30 min)

The D8 study ran evaluations across 47 sessions to detect drift. Your Part B captured one snapshot. Now design the regression protocol.

Answer the following in your lab report:

  1. Trigger conditions. In a production system, what events should trigger a re-run of this evaluation suite? List at least 4 (model version update, prompt change, etc.).

  2. Drift threshold. If your baseline OL score is 4/5, at what score would you flag a regression? What would you do about it?

  3. Test suite maintenance. The D8 study found some evaluation criteria decayed over time (a model improved, making the test trivially easy). How would you keep the test suite meaningful as models improve?

  4. One addition. If you were extending this suite to cover a new risk axis not in OL/PR/W, what would you add and how would you measure it?


Lab Report

Create lab-2-report.md with:

  1. The Baseline Scorecard (from Part B)
  2. Answers to the 4 regression design questions (from Part C)
  3. One paragraph: what the D8 methodology reveals about model selection that benchmark leaderboards do not

Submit dvla-baseline/promptfooconfig.yaml and dvla-baseline/results.json alongside the report.


Grading

Component Points
promptfooconfig.yaml: ≥13 tests covering all 3 passes, valid syntax 6
Baseline scorecard: all fields filled, verdict justified 5
results.json attached (or equivalent evidence of a real run) 3
Regression design: all 4 questions answered with specificity 4
D8 vs. leaderboard paragraph: identifies at least one concrete gap 2
Total 20