On this page
§.FAQ

Does Prompt Assay support regression testing on prompt changes?

Yes. Regression testing reuses your existing eval suite to score version N vs N+1 of a prompt side-by-side, with manual or auto-on-version triggers.

Updated 2026-04-28 · By Jon Lasley

Yes. Prompt Assay ships built-in regression testing on the prompt detail page. Reuses your existing eval suite to score two versions of the same prompt side-by-side and reports per-case verdicts (improved, regressed, unchanged, new, failed) plus an aggregate summary.

How it works

  1. Create an eval suite with test cases that have an evaluation_criteria rubric (the same suites that power judge-based scoring).
  2. Open the prompt's Regression tab. Pick a suite, optionally pick a baseline version other than the immediate predecessor, and click Run regression.
  3. Each case is generated and judged at both versions. Per-case verdict is improved if the score went up by more than 0.5, regressed if down by more than 0.5, otherwise unchanged.
  4. Optional: turn on Auto-run on every new version for that prompt. Every accepted AI rewrite, improve suggestion, or lint fix automatically fires a regression sweep against the prior version.

What's different from competitors

Promptfoo and Braintrust support regression but require CI wiring, a separate repo, or a YAML config. LangSmith gates regression behind per-trace billing that scales linearly with traffic. Prompt Assay runs the whole loop inside the workbench, scored by the same judge that powers your eval suite, with no CI step and no per-trace charge. You pay your provider directly via BYOK; we don't mark up inference.

Tier requirement

As of 2026-05-07, evaluation suites and auto-regression are open on every tier including Free. Each judge call counts against the Free tier's 250-call monthly cap, so a 20-case regression sweep consumes up to 40 calls (one per side per case). The Regression page only surfaces a 'disabled by admin' card when a platform-admin custom override has explicitly disabled eval_suites_enabled for the workspace.

Limits

  • 20 cases per sweep (combined test cases + starred Playground runs).
  • Each case costs up to 4 LLM calls in the worst case (baseline generate + judge + current generate + judge). Cached baselines drop this to 2.
  • Scores are cached per test case + prompt version pair so subsequent regressions against the same baseline reuse cached scores at no cost. Cache writes happen on every Run-now and Run-all in Evaluate too, so regression often finds baselines already cached the first time you trigger it.
Roadmap follow-ups
Larger suites need a Batches-API path (async, with the 50% provider-discount window). Email or Slack notifications on detected regressions are not currently shipped.