Issue №05Evaluation & Testing

How to set up prompt regression testing

Jonathan Lasley16 min read
Specimen plate listing seven steps as PICK, DEFINE, BUILD, WRITE, RUN, AUTOMATE, JUDGE under the title The Procedure.

Prompt regression testing means running a fixed set of test cases against a prompt before each change ships, scoring each result against a rubric, and only deploying when the suite passes. It catches silent breakage from prompt edits and from upstream model updates the same way unit tests catch code regressions: a documented expectation that a change cannot violate.

On this page

What prompt regression testing is (and what it isn't)

A regression in production code is when a change you shipped broke something that used to work. Prompt regression testing is the same idea applied to system prompts. You maintain a fixed set of inputs (your golden dataset) and expected criteria (your rubric), and before any prompt change ships you re-run the dataset and confirm nothing got worse.

The discipline gets confused with drift detection often enough to be worth separating up front. Regression testing is a pre-deploy gate on a fixed dataset. Drift detection is post-deploy monitoring of live behavior over time. Both are useful. They answer different questions ("did my edit break anything?" vs. "did the model start behaving differently in production?") and the rest of this guide is about the first one.

The seven steps below take you from "I have one important prompt and no testing discipline" to "I gate every change against a versioned suite." Steps 1 through 5 work at every tier of every prompt tool, including the free tier here. Step 6 is where automation enters the picture and where Prompt Assay's Team tier ($99 per seat per month) starts to earn its price. Step 7 is about fitting the discipline into your release process.

  1. Pick the prompt that matters most.
  2. Define what "good" looks like.
  3. Build the golden dataset.
  4. Write a scoring rubric.
  5. Run your first manual test.
  6. Move to eval suites when manual stops scaling.
  7. Add LLM-as-judge for subjective quality.

Step 1: Pick the prompt that matters most

Don't try to test every prompt in your codebase on day one. Pick the one whose failure would cost you the most: the customer-facing classifier, the conversion-funnel rewrite, the revenue-attribution router. If you have ten prompts and zero test discipline, the right move is one prompt with a real suite, not ten prompts with one test case each.

The shape of "matters most" varies. For a customer-support team it's usually the triage classifier that decides whether a ticket goes to billing, technical, or refunds. For a content team it's the draft rewriter that touches every piece before publish. For an internal-tools team it's whichever prompt sits in the synchronous request path of a user-facing feature.

Pick one. The discipline you build around it generalizes. The time you spend on it doesn't compete with feature work the way "test everything" does. Teams that build a hardened test suite around one prompt before a migration off Humanloop often credit that suite with what made the migration tractable.

Step 2: Define what "good" looks like

Before you write a single test case, write down what a good output looks like. This is the part teams skip and then regret three weeks later when their tests pass and the production behavior keeps drifting anyway.

Pull out a markdown file and answer four questions about your chosen prompt:

  • What does the output need to be (a JSON object, a paragraph of prose, a numeric score, a tool call)?
  • What does it need to contain (specific fields, certain keywords, references to the input)?
  • What must it avoid (claims the product doesn't support, refusals on legitimate requests, outputs that change the JSON shape)?
  • What does the edge case look like (the angry refund request, the ambiguous classification, the user input that's mostly emoji)?

If you've used a structured critique tool, you have a starting vocabulary. Prompt Assay's six-dimension critique uses Clarity, Completeness, Structure, Technique Usage, Robustness, and Efficiency, and any of those can become a rubric criterion. Other useful starting vocabularies live in the 60-technique field guide under the families that match your prompt's job (classification, rewriting, extraction, agentic). The point isn't the framework you pick. The point is that "good" is concrete and written down before the test cases are.

Step 3: Build the golden dataset

Now write the test cases. Aim for 8 to 20 to start. You can always add more, and you'll add them anyway as you discover edge cases in production.

Each case is two things: an input (the user turn for a chat-style prompt, or the variable bindings for a template prompt) and a set of expected criteria (the rubric items from Step 2, attached to this specific case). Cover three families of cases:

  1. The happy path: typical inputs the prompt was designed for. Three or four cases.
  2. The edge cases: unusual inputs that have specific desired behavior. Five to ten cases. These are where regressions hide.
  3. The adversarial cases: inputs designed to make the prompt fail (jailbreak attempts, prompt-injection probes, intentionally ambiguous queries). Two or three cases minimum.

Evidently AI's LLM testing guide makes the right point about scale: a dozen test cases is dramatically better than zero, and you should not let "I need 500 cases first" stop you from shipping the first dozen. As you grow toward production scale, Maxim AI's golden-dataset guide derives roughly 246 samples per scenario from a 95%-confidence formula for production-critical evaluation. Treat both numbers as bounds, not requirements. Start at twelve.

Store the cases somewhere you can re-run them. A markdown file with a JSON code block per case works on day one. A spreadsheet works if your team prefers that. By Step 6 you'll move them into a real suite. Until then, any storage that survives a git pull is fine.

Step 4: Write a scoring rubric

With cases in hand, you need a way to decide whether each result passes or fails. The most useful pattern for prompt outputs is a weighted criterion rubric scored on a 0-to-1 scale, with a pass threshold you set per suite.

Here's what one criterion looks like, written badly and then well:

  • Bad: { "description": "Good answer", "weight": 1 }
  • Good: { "description": "Identifies the root cause of the user's issue in the first paragraph", "weight": 2 }

The bad version makes the judge invent its own interpretation, which means the same output scores differently across runs. The good version is a yes/no check a human reader could perform in five seconds. That's the bar.

Write four to eight criteria per case. Mix positive checks ("returns valid JSON matching the schema") with negative checks ("does NOT promise behavior the product doesn't support"). The negatives catch the regression class that pure-positive rubrics miss: a model that starts hallucinating new capabilities scores fine on every "must include" criterion and still ships you a customer-trust problem.

Compute the overall score with a weighted average:

overall_score = sum(criterion.score × criterion.weight) / sum(weight)

A case passes when overall_score >= threshold. A pass threshold of 0.7 is a reasonable default for most teams. Crank it to 0.85 for customer-facing prompts where a near-miss is still a problem. Relax it to 0.5 for exploratory work where you want anything-not-obviously-broken to pass while you iterate. Move it to 0.95 for safety-critical prompts where any criterion missing is a blocker; almost nothing will pass, and that's the intended behavior.

Step 5: Run your first manual test

Now run the prompt against each case and score the result. Manually, the first time. The point of running it manually is that you'll catch ambiguities in your rubric you didn't notice when you wrote it.

For each case:

  1. Send the input to the prompt against the model you ship to production. Not a cheaper model, not a different provider; you're testing the actual production behavior.
  2. Score each criterion 0 to 1 by reading the output and asking the rubric question.
  3. Compute the overall score with the formula above.
  4. Mark pass or fail against your threshold.
  5. If a case scores ambiguously, fix the rubric, not the score. Vague criteria are the failure mode to fix first.

Here's a worked test case for a support-triage classifier, the kind of thing you might already have running in production:

{
  "name": "Angry refund request → billing / high",
  "input": "I want my money back NOW. You charged me twice last month and nobody is answering.",
  "expected_criteria": [
    { "description": "Classifies as 'billing' category", "weight": 2 },
    { "description": "Priority is 'high' or 'urgent'", "weight": 2 },
    { "description": "Acknowledges the customer's frustration", "weight": 1 },
    { "description": "Returns valid JSON matching the schema", "weight": 3 }
  ],
  "pass_keywords": ["billing"],
  "fail_keywords": ["I don't know", "cannot help"]
}

A run against this case fires the prompt with the input as the user turn, captures the JSON output, checks that "billing" appears and that the failure phrases don't, then reads each rubric criterion and assigns a 0-to-1 score. With weights of 2, 2, 1, and 3, a result that scores 1.0 on each except 0.6 on "acknowledges frustration" produces (2 + 2 + 0.6 + 3) / 8 = 0.95, which clears a 0.7 threshold cleanly.

Re-run on every prompt change. Re-run again whenever your provider ships a new model version, even on a point release; silent behavior shifts on minor model updates are a documented production failure mode and the most common surprise on the regression curve. Manual cadence is fine. The point of regression testing is the gate, not the speed of the gate. If you only push prompt changes once a week, a fifteen-minute manual run is a complete regression workflow.

Step 6: When manual stops scaling, eval suites (Team tier · $99 per seat per month)

Manual scoring works until it doesn't. The breakpoint is somewhere around three or four prompts with twenty cases each: roughly 240 rubric checks per release, which is when you start either skipping the test or skipping the deploy. That's where automated eval suites earn their keep.

In Prompt Assay this is the Team tier at $99 per seat per month. We name the price directly because the alternative is the worse experience of finding the paywall at the moment you're trying to use the feature. New eval-suite creation requires Team. Viewing, editing, and running existing suites stays open on every tier; if you've already built suites under a Team subscription and downgraded, your suites still work.

What you get when you move from a markdown notebook to a stored suite:

  • Versioned cases: input, expected criteria, pass and fail keywords, and weights, all stored alongside the prompt and the prompt's version history.
  • Dual-gate scoring: a fast keyword pre-filter that catches obviously-wrong outputs without an LLM call, followed by a judge call against the rubric for the actual reasoning gate. The pre-filter is free; the judge call uses your BYOK key.
  • One-click rerun: change the prompt, hit Run All, see which cases regressed in seconds rather than minutes.
  • Bulk add: paste a TSV or CSV where each row becomes one test case, useful when you're importing a regression set from a spreadsheet you've been keeping.
  • Shared across the workspace: suites are owned by the org, not the engineer. Every team member with the right role can read, run, and edit the suite under the same role model as the rest of the prompt library; you're not maintaining a per-engineer registry by hand.

Other tools that automate the same workflow include Promptfoo, DeepEval, LangSmith, and Anthropic's Console Evaluation. Step 7 covers the comparison. The point isn't that Prompt Assay is the only way; it's that some piece of automation is what you want at this scale, and "manual notebook" stops being honest at twelve cases per prompt across three prompts.

Step 7: Add LLM-as-judge for subjective quality (and pick the right tool)

Most of what we've described scores well with a keyword check or a binary "did the JSON validate" criterion. But some criteria are genuinely subjective: "tone matches our brand voice," "explanation is appropriate for a non-technical reader," "doesn't sound like AI-generated copy." Those are the criteria where you need a judge.

LLM-as-judge means handing the output and the rubric to a second model and asking it to score. It works, with caveats:

  • Position bias: judges tend to prefer outputs in the position they're shown first. Mitigate by randomizing case order in the suite, or by running the judge twice with the order swapped and averaging.
  • Self-preference bias: a judge of model X tends to prefer outputs from model X. If the prompt under test runs on Claude, run the judge on a different model family (GPT-4.1 or Gemini) for a less correlated read.
  • Rubric variance: judge scores vary across runs on subjective criteria, even with stable models. The technique is well-calibrated overall; Zheng et al.'s MT-Bench paper showed strong LLM judges agree with human evaluators over 80% of the time on subjective benchmarks. Trust trends across many cases, not single-run scores.

In Prompt Assay the judge runs on whichever Workbench Model you've selected, at every tier (Free, Solo, Team, Enterprise), so judge results stay reproducible against the model you ship to production. You're billed by your BYOK provider directly. We don't mark up tokens; that's the whole BYOK economics argument.

A few words on the tools landscape, because the brief promised a step on it. Five reasonable options:

  • Prompt Assay is the workbench-integrated option: eval suites live next to the prompt versions, the AI pair, and the BYOK billing. Team tier and up. We do not ship a native GitHub Action today; the public REST API and TypeScript SDK let you trigger eval-suite runs from any pipeline that can hit a REST endpoint, but if a first-class CI integration is your top requirement, Promptfoo and DeepEval ship that path and we do not.
  • Promptfoo (now part of OpenAI as of March 2026) is the open-source CLI: declarative YAML config, strong red-team suite, native CI integration with GitHub Actions and others, MIT-licensed, around 20.5k GitHub stars. Right pick if your team lives in the repo.
  • DeepEval is the pytest-native option: a wide variety of LLM eval metrics, runs inside your existing test runner, integrates with any CI/CD environment. Right pick if your team thinks of evals as unit tests.
  • LangSmith is the offline-eval hub from the LangChain ecosystem. Right pick if you're already on LangChain and want eval results joined with your traces.
  • Anthropic Console Evaluation is the lightweight built-in: paste a CSV, get side-by-side comparisons with manual 5-point grading. Right pick for a fast read on a single prompt without committing to a tool.

Honest take: pick whichever fits your team's existing surface area. If you ship prompts from a workbench UI, the workbench-integrated option (us) means less context-switching between tools. If you ship from a repo with strong CI discipline, Promptfoo or DeepEval is the more natural fit. The discipline matters more than the tool. Most teams who fail at regression testing fail because they never built the rubric, not because they picked the wrong vendor.

Where to go from here

The discipline is what matters. Pick a prompt, write a rubric, run a dozen cases by hand, and re-run before your next prompt change. That's a complete regression workflow you can ship this week.

When the manual cadence stops scaling, open the editor and try the eval-suite UI on the Team tier, or read the eval-suite docs for the technical detail of how the dual-gate scoring and the judge call assemble.

Frequently Asked Questions

Reader notes at the edge of the argument.

Ship your next prompt in the workbench.

Prompt Assay is the workbench for shipping production LLM prompts. Version every change. Critique, improve, and compare across GPT, Claude, and Gemini. Bring your own keys. No demo call. No card. No sales gate.

Open the editorRead the docs

Issue №05 · Published APRIL 24, 2026 · Prompt Assay