On this page
§.Skills

Tiers, cost, and BYOK for Skills

Per-tier caps for Behavioral Eval, what each Skill instrument bills your provider, the $5 cost-preflight gate, and how to read the per-skill cost dialog.

Updated 2026-05-06 · By Jon Lasley

Skills are BYOK at every tier. Every LLM call from a Skill instrument (Critique, Improve, Rewrite, Brainstorm, Behavioral Eval, judge) bills your provider account directly. Prompt Assay never marks up inference, never proxies your traffic, and never takes a percentage. Tier price covers the workbench, not the model calls.

Demo budget does NOT cover Skills
New accounts get a 7-call platform-funded demo budget on prompt Critique / Improve / Rewrite (Sonnet 4.6, thinking off). Skills are not on the demo allowlist; running any Skill instrument requires a connected BYOK key from the start. This is intentional: Behavioral Eval can fan out 50+ calls in a single run, and we cap exposure at the action level, not the token level.

Tier caps

Skill count is unlimited at every tier. As of 2026-05-07, the per-run Behavioral Eval cap is uniform across every tier: up to 5 models, 10 trigger probes, and 6 non-trigger probes per run. Cost is gated by the $5 cost-preflight estimate, the Free tier's 250 monthly_ai_calls cap, and the user-grain rate limit (6 BE runs per minute).

TierPriceSkillsBE models / runBE trigger probesBE non-trigger probesPublish Skill ReportGitHub Action
Free$0Unlimited5106YesYes
Solo$49 / moUnlimited5106YesYes
Team$99 / seat / moUnlimited5106YesYes
EnterpriseCustomUnlimited5106YesYes (with SSO + audit)
Why uniform caps across tiers
BYOK already routes provider cost straight to your account. The 5 × 10 × 6 ceiling is a structural safety net (max 80 cells = 160 LLM calls per run); the cost-preflight gate ($5 threshold), the Free monthly cap (250 calls), and the rate limiter (6/min) are the real cost levers. Gating per-run breadth at lower tiers throttled the personas we most want to convert without meaningfully protecting cost.

What gets billed per action

ActionLLM callsNotes
Critique1 call to your Workbench ModelSix-dimension scorecard. Streams in. Falls back to the cheapest available BYOK model if no Workbench Model is set.
Improve1 call to your Workbench Model + 1 cheap classifierThe classifier picks the dimensions to focus on; the main call generates targeted edit suggestions for every dimension < 8.5.
Rewrite1 call to your Workbench ModelFull-bundle rewrite. Chain-runs after Critique by default, so you'll typically see Critique + Rewrite = 2 calls.
BrainstormN calls to your Workbench ModelOne call per turn in the conversation. The full message history is re-sent each turn (standard chat semantics).
Behavioral Eval(probes × models) main calls + (probes × models) judge callsA 6-trigger × 2-non-trigger × 3-model run is 8 × 3 = 24 main calls plus 24 judge calls = 48 total. The cost preflight estimates this before you click Run.

The $5 cost-preflight gate

Before a Behavioral Eval run starts, the panel computes an estimated cost based on the selected models, the probe count, and rough token estimates from the SKILL.md size. The estimate appears inline next to the Run button.

  • Under $5: Run starts immediately when you click.
  • Over $5: A confirmation dialog appears with the breakdown (per-model cost, judge cost, total). You confirm with an explicit click before the run begins.
  • Per-run cap: the structural ceiling is 5 × 10 × 6 = 80 cells × 2 = 160 LLM calls per run, applied uniformly across every tier, which keeps the worst case bounded even before the $5 gate.
Preflight is an estimate, not a meter
Actual cost depends on output token counts, which can't be predicted exactly before the model runs. Real cost lands in your usage logs immediately after the run completes; the per-skill cost dialog (see below) reflects it.

Cost monitoring

Two surfaces show what each Skill is costing you:

  • Workspace dashboard at `/usage` shows org-wide cost broken down by action, model, and provider. Filter by date range.
  • Per-skill cost dialog opens from the cost icon in the Skill workbench header. Shows the same usage_logs data but filtered to one skill, broken down by action, model, and version. Three time-window toggles (Last 24h / This Week / This Month).

Both surfaces pull from the same usage_logs table that's been recording every LLM call since you signed up. There's no separate "skills cost" meter; it's just the same usage logs filtered by metadata.skill_id.

Anthropic Skill Creator vs Prompt Assay

Anthropic ships a Skill Creator that walks you through scaffolding a SKILL.md inside Claude. It's a great starting point. But it doesn't tell you how the Skill behaves on GPT or Gemini, doesn't score discovery against non-trigger probes, and doesn't give you a public report you can link from a README. Use both:

Use the Anthropic Skill Creator when...Use Prompt Assay when...
You're scaffolding a brand-new Skill from a natural-language description and want Claude to draft the frontmatter and body.You have a SKILL.md already (drafted, imported, or downloaded from the Anthropic skills repo) and need to know how reliably it activates and adheres across providers.
You only target Claude and you trust Claude's own discovery routing.You target two or more of Claude / GPT / Gemini and need cross-provider consistency scoring.
You want a single conversational tool to draft and iterate without leaving Claude.You want versioning + diff + restore, a six-dimension scorecard, a 19-rule linter with security tier, and a public Skill Report with a README badge.

Most teams import the Anthropic Skill Creator's output into Prompt Assay, run a Behavioral Eval to confirm cross-provider behavior, then iterate on the dimensions that score below 8. The two tools are complementary, not competitive.