On this page
Tiers, cost, and BYOK for Skills
Per-tier caps for Behavioral Eval, what each Skill instrument bills your provider, the $5 cost-preflight gate, and how to read the per-skill cost dialog.
Skills are BYOK at every tier. Every LLM call from a Skill instrument (Critique, Improve, Rewrite, Brainstorm, Behavioral Eval, judge) bills your provider account directly. Prompt Assay never marks up inference, never proxies your traffic, and never takes a percentage. Tier price covers the workbench, not the model calls.
Tier caps
Skill count is unlimited at every tier. As of 2026-05-07, the per-run Behavioral Eval cap is uniform across every tier: up to 5 models, 10 trigger probes, and 6 non-trigger probes per run. Cost is gated by the $5 cost-preflight estimate, the Free tier's 250 monthly_ai_calls cap, and the user-grain rate limit (6 BE runs per minute).
| Tier | Price | Skills | BE models / run | BE trigger probes | BE non-trigger probes | Publish Skill Report | GitHub Action |
|---|---|---|---|---|---|---|---|
| Free | $0 | Unlimited | 5 | 10 | 6 | Yes | Yes |
| Solo | $49 / mo | Unlimited | 5 | 10 | 6 | Yes | Yes |
| Team | $99 / seat / mo | Unlimited | 5 | 10 | 6 | Yes | Yes |
| Enterprise | Custom | Unlimited | 5 | 10 | 6 | Yes | Yes (with SSO + audit) |
What gets billed per action
| Action | LLM calls | Notes |
|---|---|---|
| Critique | 1 call to your Workbench Model | Six-dimension scorecard. Streams in. Falls back to the cheapest available BYOK model if no Workbench Model is set. |
| Improve | 1 call to your Workbench Model + 1 cheap classifier | The classifier picks the dimensions to focus on; the main call generates targeted edit suggestions for every dimension < 8.5. |
| Rewrite | 1 call to your Workbench Model | Full-bundle rewrite. Chain-runs after Critique by default, so you'll typically see Critique + Rewrite = 2 calls. |
| Brainstorm | N calls to your Workbench Model | One call per turn in the conversation. The full message history is re-sent each turn (standard chat semantics). |
| Behavioral Eval | (probes × models) main calls + (probes × models) judge calls | A 6-trigger × 2-non-trigger × 3-model run is 8 × 3 = 24 main calls plus 24 judge calls = 48 total. The cost preflight estimates this before you click Run. |
The $5 cost-preflight gate
Before a Behavioral Eval run starts, the panel computes an estimated cost based on the selected models, the probe count, and rough token estimates from the SKILL.md size. The estimate appears inline next to the Run button.
- Under $5: Run starts immediately when you click.
- Over $5: A confirmation dialog appears with the breakdown (per-model cost, judge cost, total). You confirm with an explicit click before the run begins.
- Per-run cap: the structural ceiling is 5 × 10 × 6 = 80 cells × 2 = 160 LLM calls per run, applied uniformly across every tier, which keeps the worst case bounded even before the $5 gate.
Cost monitoring
Two surfaces show what each Skill is costing you:
- Workspace dashboard at `/usage` shows org-wide cost broken down by action, model, and provider. Filter by date range.
- Per-skill cost dialog opens from the cost icon in the Skill workbench header. Shows the same
usage_logsdata but filtered to one skill, broken down by action, model, and version. Three time-window toggles (Last 24h / This Week / This Month).
Both surfaces pull from the same usage_logs table that's been recording every LLM call since you signed up. There's no separate "skills cost" meter; it's just the same usage logs filtered by metadata.skill_id.
Anthropic Skill Creator vs Prompt Assay
Anthropic ships a Skill Creator that walks you through scaffolding a SKILL.md inside Claude. It's a great starting point. But it doesn't tell you how the Skill behaves on GPT or Gemini, doesn't score discovery against non-trigger probes, and doesn't give you a public report you can link from a README. Use both:
| Use the Anthropic Skill Creator when... | Use Prompt Assay when... |
|---|---|
| You're scaffolding a brand-new Skill from a natural-language description and want Claude to draft the frontmatter and body. | You have a SKILL.md already (drafted, imported, or downloaded from the Anthropic skills repo) and need to know how reliably it activates and adheres across providers. |
| You only target Claude and you trust Claude's own discovery routing. | You target two or more of Claude / GPT / Gemini and need cross-provider consistency scoring. |
| You want a single conversational tool to draft and iterate without leaving Claude. | You want versioning + diff + restore, a six-dimension scorecard, a 19-rule linter with security tier, and a public Skill Report with a README badge. |
Most teams import the Anthropic Skill Creator's output into Prompt Assay, run a Behavioral Eval to confirm cross-provider behavior, then iterate on the dimensions that score below 8. The two tools are complementary, not competitive.