WEEK OF APR 27Feature

Demo mode, cost drill-down, the GPT-5 lineage, and a much sharper AI pair

A big week of ships: free demo runs for new accounts, per-run cost receipts, GPT-5 and Gemini 2.5 Flash-Lite, shareable Critique and Compare, plus cross-model testing in the Playground.

A big week of ships. Five things you'll notice as a new user, plus model-lineage, AI-pair-polish, and shareability work that lets a paying user trust the bill and show their work.

Demo mode for new accounts

A new account without a connected provider key now gets a seven-call lifetime budget on Critique, Improve, and Rewrite, powered by a Prompt Assay-funded Anthropic key on Claude Sonnet 4.6. Author your first prompt and watch the AI pair work on it before connecting anything of your own.

The budget is enforced atomically at the database layer. The moment you connect a real BYOK key, the budget is permanently disabled, and deleting the BYOK key afterwards does not revive it. The BYOK posture stays honest: your provider's bill, your provider's account, no platform middleman.

Cost drill-down per prompt

Every AI run, on every panel, now shows what the call cost when it finishes. The five panels (Brainstorm, Critique, Improve, Rewrite, Compare) each carry a Cost receipt with input tokens, output tokens, cached-input tokens, and total in dollars. Cents resolution against the published per-1M rates for the actual model that ran.

The usage dashboard drills down by prompt and by version, so you can answer questions like which prompts are eating the budget this month and did our last revision get cheaper or more expensive.

GPT-5 lineage and Gemini 2.5 Flash-Lite

The model registry refresh added the full GPT-5.x family (5.5, 5.5 Pro, 5.4, 5.4 Pro, 5.4 Mini, 5.4 Nano, 5.3 Codex) and Gemini 2.5 Flash-Lite. Pricing, context window, prompt-caching support, reasoning-effort levels, and thinking-budget ranges all current. Limited-access models surface a Limited access badge in the dropdown so you can see access state before you pick.

Gemini 2.5 Flash-Lite is now the default for classifier-tier tasks like Generate Intent and the Workbench Model fallback when no Workbench Model is set.

A sharper AI pair

The five AI-pair panels (Brainstorm, Critique, Improve, Rewrite, Compare) got a polish pass this week.

One-click re-run on staleness banners. When a Critique goes stale because you've kept editing, the banner on Improve and Rewrite picks up the workbench's editorial register, with a Re-run Critique button on the right. Same treatment across every staleness banner in the AI pair.

Thinking heartbeat for OpenAI and Google reasoning models. OpenAI's o-series and Gemini 2.5 Pro do their reasoning silently before delivering output, which can take a moment on a deep prompt. A new Thinking… indicator surfaces during that phase so you see the model is working, then disappears the instant text starts streaming. Matches the existing Anthropic extended-thinking treatment. Anthropic streams are unchanged.

Cleaner Brainstorm parsing. Brainstorm's Incomplete edit warning fires only on real incomplete edits.

Unified Callout treatment across the AI pair. Inline notes ("BYOK key required," "tier-gated," "no actionable improvements") share a single visual treatment across all five AI-pair panels.

You can now publish any Critique or Compare result as a frozen, public-link snapshot. Click Share on a Critique or a Compare panel and you get a /share/critique/<id> or /share/compare/<id> URL that anyone can open without an account. Viewers never re-run the LLM; the public route only reads the persisted snapshot.

Defaults are conservative: shares are noindex by default, the prompt body is NOT included on Critique shares unless you opt in, and you can revoke at any time from the new /settings/shares management page (linked from the user menu). Discoverable opt-in is per-share. Prompt-body publishing is gated to the original prompt author.

Per-share OG images render so X and LinkedIn unfurl with a real preview rather than a generic page card. The dialog also gives you pre-formatted X and LinkedIn copy and one-click copy-to-clipboard. Use it to show your work in a thread or a post without screenshotting.

This is the single user-initiated exception to our otherwise strict no-server-side-AI-cache rule: explicit per-result publish, owner-revocable, snapshot-only, authorship-gated where it matters.

Regression testing on version change

A new /prompts/[id]/regression surface scores any two versions of a prompt side-by-side against your eval suite. Save a new version, run regression, see per-case improved / unchanged / regressed counts plus per-criterion deltas. Score history persists per version so subsequent runs reuse prior baselines.

A per-prompt auto-regression toggle fires the run after you accept an AI rewrite, so the score delta is waiting for you when you next open the prompt. Same tier gate as eval suites, same BYOK posture, same impersonation block.

(Updated 2026-05-07: migration 054 opened eval suites and auto-regression on every tier including Free. The Team/Enterprise gate this entry described is no longer in effect; the only remaining block path is a platform-admin custom override on a specific workspace.)

Granular reasoning effort for GPT-5.x

GPT-5.x reasoning models now expose the full effort range (none / low / medium / high / xhigh, varying by model) via a new effort dropdown on every AI-pair panel. Non-reasoning models (Claude, Gemini, GPT-4.1) keep the existing Deep analysis checkbox.

The selection persists as a workbench-level preference so it carries across panels and prompts. Switching workbench models clamps the saved preference to the new model's accepted set with a hint so you know what happened. The cost-preview modal scales the output-token estimate by the selected effort level so the cost number reflects the chosen depth of reasoning.

Cross-model comparison testing in the Playground

Run the same prompt across multiple models in parallel. Open a prompt, click Compare models in the playground toolbar, pick 2 to 5 engine models from any provider you have a BYOK key for, and run.

Each model gets its own column. Output streams in live, side-by-side. Tokens, latency, and cost surface beneath each output as it completes. If one provider rate-limits or your BYOK key for that provider is missing, that column shows the precise error and the rest of the comparison keeps running. Closing the tab aborts every in-flight provider call immediately, so you're never billed for tokens past the cancel point.

After the run, an optional LLM-as-Judge pass scores every model's output against a shared rubric you write (one criterion per line, up to 12). The judge picks the best model per criterion and surfaces a BEST · N badge on the column of any model that won at least once. The judge runs once per report; clicking Judge all again returns the existing winners without spending another token.

Click Save report and the comparison becomes a public-link snapshot at /share/model-report/<id>. Same noindex-by-default behavior as the Critique and Compare shares. Viewers never re-run any model; the snapshot is frozen.

Multi-model is BYOK-only · the new-account demo budget runs single-model on Anthropic, and multi-model fan-out across N providers requires keys for those providers in your workspace.

Quietly underneath

Improvements & fixes
  • Limited access badge renders inline in model dropdowns. GPT-5.5 Pro and GPT-5.4 Pro carry an explicit visual badge so you don't pick a model and bounce off a 403.
  • Brainstorm preserves <suggested_edit> XML on OpenAI and Google. Apply-block formatting holds across providers.

What's next

Per-criterion winner highlighting in shareable form for Model Reports. Possibly a "leader board" view across saved reports for the same prompt. Tell us if either is useful before we build.

promptassay.aiChangelogWeek of APR 27 · 2026By Jon Lasley