Tiers, cost, and BYOK for Skills

Per-tier caps for Behavioral Eval, what each Skill instrument bills your provider, the $5 cost-preflight gate, and how to read the per-skill cost dialog.

Updated 2026-05-06 · By Jon Lasley

Skills are BYOK at every tier. Every LLM call from a Skill instrument (Critique, Improve, Rewrite, Brainstorm, Behavioral Eval, judge) bills your provider account directly. Prompt Assay never marks up inference, never proxies your traffic, and never takes a percentage. Tier price covers the workbench, not the model calls.

Demo budget does NOT cover Skills

New accounts get a 7-call platform-funded demo budget on prompt Critique / Improve / Rewrite (Sonnet 4.6, thinking off). Skills are not on the demo allowlist; running any Skill instrument requires a connected BYOK key from the start. This is intentional: Behavioral Eval can fan out 50+ calls in a single run, and we cap exposure at the action level, not the token level.

Tier caps

Skill count is unlimited at every tier. As of 2026-05-07, the per-run Behavioral Eval cap is uniform across every tier: up to 5 models, 10 trigger probes, and 6 non-trigger probes per run. Cost is gated by the $5 cost-preflight estimate, the Free tier's 250 monthly_ai_calls cap, and the user-grain rate limit (6 BE runs per minute).

Tier	Price	Skills	BE models / run	BE trigger probes	BE non-trigger probes	Publish Skill Report	GitHub Action
Free	$0	Unlimited	5	10	6	Yes	Yes
Solo	$49 / mo	Unlimited	5	10	6	Yes	Yes
Team	$99 / seat / mo	Unlimited	5	10	6	Yes	Yes
Enterprise	Custom	Unlimited	5	10	6	Yes	Yes (with SSO + audit)

Why uniform caps across tiers

BYOK already routes provider cost straight to your account. The 5 × 10 × 6 ceiling is a structural safety net (max 80 cells = 160 LLM calls per run); the cost-preflight gate ($5 threshold), the Free monthly cap (250 calls), and the rate limiter (6/min) are the real cost levers. Gating per-run breadth at lower tiers throttled the personas we most want to convert without meaningfully protecting cost.

What gets billed per action

Action	LLM calls	Notes
Critique	1 call to your Workbench Model	Six-dimension scorecard. Streams in. Falls back to the cheapest available BYOK model if no Workbench Model is set.
Improve	1 call to your Workbench Model + 1 cheap classifier	The classifier picks the dimensions to focus on; the main call generates targeted edit suggestions for every dimension < 8.5.
Rewrite	1 call to your Workbench Model	Full-bundle rewrite. Chain-runs after Critique by default, so you'll typically see Critique + Rewrite = 2 calls.
Brainstorm	N calls to your Workbench Model	One call per turn in the conversation. The full message history is re-sent each turn (standard chat semantics).
Behavioral Eval	(probes × models) main calls + (probes × models) judge calls	A 6-trigger × 2-non-trigger × 3-model run is 8 × 3 = 24 main calls plus 24 judge calls = 48 total. The cost preflight estimates this before you click Run.

The $5 cost-preflight gate

Before a Behavioral Eval run starts, the panel computes an estimated cost based on the selected models, the probe count, and rough token estimates from the SKILL.md size. The estimate appears inline next to the Run button.

Under $5: Run starts immediately when you click.
Over $5: A confirmation dialog appears with the breakdown (per-model cost, judge cost, total). You confirm with an explicit click before the run begins.
Per-run cap: the structural ceiling is 5 × 10 × 6 = 80 cells × 2 = 160 LLM calls per run, applied uniformly across every tier, which keeps the worst case bounded even before the $5 gate.

Preflight is an estimate, not a meter

Actual cost depends on output token counts, which can't be predicted exactly before the model runs. Real cost lands in your usage logs immediately after the run completes; the per-skill cost dialog (see below) reflects it.

Cost monitoring

Two surfaces show what each Skill is costing you:

Workspace dashboard at `/usage` shows org-wide cost broken down by action, model, and provider. Filter by date range.
Per-skill cost dialog opens from the cost icon in the Skill workbench header. Shows the same usage_logs data but filtered to one skill, broken down by action, model, and version. Three time-window toggles (Last 24h / This Week / This Month).

Both surfaces pull from the same usage_logs table that's been recording every LLM call since you signed up. There's no separate "skills cost" meter; it's just the same usage logs filtered by metadata.skill_id.

Anthropic Skill Creator vs Prompt Assay

Anthropic ships a Skill Creator that walks you through scaffolding a SKILL.md inside Claude. It's a great starting point. But it doesn't tell you how the Skill behaves on GPT or Gemini, doesn't score discovery against non-trigger probes, and doesn't give you a public report you can link from a README. Use both:

Use the Anthropic Skill Creator when...	Use Prompt Assay when...
You're scaffolding a brand-new Skill from a natural-language description and want Claude to draft the frontmatter and body.	You have a SKILL.md already (drafted, imported, or downloaded from the Anthropic skills repo) and need to know how reliably it activates and adheres across providers.
You only target Claude and you trust Claude's own discovery routing.	You target two or more of Claude / GPT / Gemini and need cross-provider consistency scoring.
You want a single conversational tool to draft and iterate without leaving Claude.	You want versioning + diff + restore, a six-dimension scorecard, a 19-rule linter with security tier, and a public Skill Report with a README badge.

Most teams import the Anthropic Skill Creator's output into Prompt Assay, run a Behavioral Eval to confirm cross-provider behavior, then iterate on the dimensions that score below 8. The two tools are complementary, not competitive.

← All docs Open the editor →