On this page
Does Prompt Assay support regression testing on prompt changes?
Yes. Regression testing reuses your existing eval suite to score version N vs N+1 of a prompt side-by-side, with manual or auto-on-version triggers.
Yes. Prompt Assay ships built-in regression testing on the prompt detail page. Reuses your existing eval suite to score two versions of the same prompt side-by-side and reports per-case verdicts (improved, regressed, unchanged, new, failed) plus an aggregate summary.
How it works
- Create an eval suite with test cases that have an
evaluation_criteriarubric (the same suites that power judge-based scoring). - Open the prompt's Regression tab. Pick a suite, optionally pick a baseline version other than the immediate predecessor, and click Run regression.
- Each case is generated and judged at both versions. Per-case verdict is improved if the score went up by more than 0.5, regressed if down by more than 0.5, otherwise unchanged.
- Optional: turn on Auto-run on every new version for that prompt. Every accepted AI rewrite, improve suggestion, or lint fix automatically fires a regression sweep against the prior version.
What's different from competitors
Promptfoo and Braintrust support regression but require CI wiring, a separate repo, or a YAML config. LangSmith gates regression behind per-trace billing that scales linearly with traffic. Prompt Assay runs the whole loop inside the workbench, scored by the same judge that powers your eval suite, with no CI step and no per-trace charge. You pay your provider directly via BYOK; we don't mark up inference.
Tier requirement
Creating evaluation suites and enabling auto-regression for the first time on a prompt require the Team plan or higher. Once you have a suite, manually running regression and existing auto-regression triggers are NOT tier-gated, matching the eval-suite run pattern: a Team workspace that downgrades to Solo keeps using its grandfathered suites and any prompts it had auto-regression set on. The Regression page surfaces an upgrade card only when you try to enable auto-regression on a workspace whose tier doesn't include eval suites.
Limits
- 20 cases per sweep (combined test cases + starred Playground runs).
- Each case costs up to 4 LLM calls in the worst case (baseline generate + judge + current generate + judge). Cached baselines drop this to 2.
- Scores are cached per test case + prompt version pair so subsequent regressions against the same baseline reuse cached scores at no cost. Cache writes happen on every Run-now and Run-all in Evaluate too, so regression often finds baselines already cached the first time you trigger it.