MAY 07 · 2026WEEK OF MAY 04Improvement

Evaluation suites open on every tier

Evaluation suites, regression testing, and Behavioral Eval are now open on every tier. Free workspaces can author, run, and judge end-to-end.

You can now author and run evaluation suites on every tier, including Free. Same for regression testing on every new version, and same for the multi-provider Behavioral Eval on Skills. The "New Test Suite" button stops being disabled. The 250 monthly AI calls cap on Free is the natural cost lever; if you actually use the feature, you'll likely upgrade to Solo for unlimited calls. Gating creation behind Team was a friction point that hurt the developers we most want to convert without meaningfully protecting cost (BYOK already routes provider cost straight to your account, not ours).

What's open now on Free

You get the full evaluation pipeline. Author a test suite for any prompt. Add weighted criteria. Run sync or via the Anthropic Batches API for the 50% Batches discount. Score with LLM-as-a-judge against your six-dimension rubric. The judge call honors your Workbench Model at every tier, same as before; the cost-warning banner still surfaces if you pick an expensive Workbench Model and the suite would land a surprise bill.

Regression testing follows the same posture. Run version N versus N+1 against your existing suite, get per-case verdicts (improved / unchanged / regressed / new / failed), and toggle auto-on-version so a regression sweep fires every time you accept an AI rewrite. None of that needs a paid plan now.

What changes for Behavioral Eval

The per-run cap on Behavioral Eval is now uniform across every tier: up to 5 models, up to 10 trigger probes, and up to 6 non-trigger probes per run. Previously Free was capped at 1 × 4 × 2 and Solo at 3 × 6 × 4. Cross-provider testing was the use case those caps blocked, and that's the use case Behavioral Eval is for. The $5 cost-preflight gate, the user-grain rate limit (6 runs / minute), and the Free tier's monthly call cap stay in place as the real cost levers.

The Free monthly cap is still load-bearing. A max-config Behavioral Eval run with 5 models × 16 probes = 80 cells × 2 calls per cell = 160 LLM calls. That's 64% of a Free user's 250-call monthly budget in a single run. To keep that visible, the panel now shows an inline "Monthly usage · X of 250 calls" banner on Free workspaces during BE setup. Color escalates at 70% (warning) and 90% (error). The banner shows the projected total after the configured run so you can see whether you'll overrun before you click Run.

Under the hood

A new Postgres migration flips tier_limits.eval_suites_enabled to true for Free and Solo (was true only for Team and Enterprise). The flag stays in the table because a platform admin can still set a per-workspace custom_overrides.eval_suites_enabled = false as an abuse escape hatch.

The app/api/skills/evaluate route now checks BYOK presence before the tier cap. New users without a connected provider key see the actionable "connect a BYOK key" error rather than a tier-limit message that wouldn't tell them what to do. The tier cap branch is retained as a structural ceiling against an over-bumped client cap.

What's next

The monthly-usage banner pattern transfers to other panels next. The same shape on Critique, Compare, and Playground would tell Free users where they are in their budget without forcing them to cross-reference the Billing page. We're picking the order based on which panels burn the most calls on Free.

MAY 06 · 2026WEEK OF MAY 04

Featureskillsai-pairevaluations

Skills authoring + multi-provider Behavioral Eval (L1A)

Skills workbench: author SKILL.md bundles, lint and critique on six dimensions, and run a Behavioral Eval that scores activation across Claude, GPT, and Gemini.

APR 30 · 2026WEEK OF APR 27

Featurebyokbillingmodels

Demo mode, cost drill-down, the GPT-5 lineage, and a much sharper AI pair

A big week of ships: free demo runs for new accounts, per-run cost receipts, GPT-5 and Gemini 2.5 Flash-Lite, shareable Critique and Compare, plus cross-model testing in the Playground.

APR 24 · 2026WEEK OF APR 20

Featurefeedbackbillingai-pair

Vote on what we ship next

A two-board feedback system where you can file bugs, request features, and vote on what we ship next. Plus a brainstorm timeout fix you may have hit on long Opus runs.