Skills workbench launches · Behavioral Eval · evaluations open on every tier
Author Agent Skills in Prompt Assay, score them with a six-dimension Critique, and run a Behavioral Eval across Claude, GPT, and Gemini. Evaluation suites now open on every tier.
A heavy week. The headline ship is Skills · the multi-file capability bundles from the agentskills.io spec, now a first-class artifact type in Prompt Assay next to Prompts. Three companion ships round out the week: Improve learned to extract sections of SKILL.md into separate files, you can download a skill as a zip or pull it via CLI, and evaluation suites + Behavioral Eval are now open on every tier.
In this entry
- Skills workbench + multi-provider Behavioral Eval · the headline launch
- Improve can extract sections of
SKILL.mdinto new files - Skill bundle export · Download button +
assay pullCLI - Evaluation suites and Behavioral Eval open on every tier
1 · Skills workbench + multi-provider Behavioral Eval
You can now author Agent Skills in Prompt Assay. The new Skills entry sits next to Prompts in the sidebar as a peer artifact type. A skill is a SKILL.md plus optional scripts/ and references/ files · the same multi-file format Anthropic, OpenAI Codex, Cursor, VS Code, GitHub Copilot, and Gemini CLI all consume.
Six-dimension Skills Critique
Same scorecard shape as the prompt-side Critique, with skill-specific definitions:
- D1 Discovery Fidelity · will the agent reliably find this skill when the user's request matches its intent?
- D2 Instruction Quality · clear, imperative, actionable?
- D3 Example Coverage · enough worked examples for the most common invocation paths?
- D4 Cross-Provider Portability · will Claude, GPT, and Gemini all read it the same way?
- D5 Token Efficiency · reasonable bundle size, no redundant context?
- D6 Security & Safety Posture · secrets handled, destructive operations gated?
Multi-provider Behavioral Eval
The load-bearing differentiator. Pick up to 5 BYOK models, author trigger probes (requests where the skill should activate) and non-trigger probes (requests where it should stay dormant), then run. The grid streams every (probe × model) cell live and reports three aggregate scores when finished:
- Discovery Fidelity · % of probes where activation matched expectation (catches false-negatives and false-positives)
- Instruction Adherence · average adherence across activated trigger probes
- Cross-Provider Consistency · how tightly Claude, GPT, and Gemini agree
Per-run caps are uniform across every tier (5 models × 10 trigger × 6 non-trigger probes). An inline pre-run cost estimate is always visible; runs that would exceed $5 require a confirmation tap. Probe scripts are NOT executed · Prompt Assay does not operate a sandbox.
Skills linter with a security tier
The linter ships with a security tier (Shield glyph in the lint panel, routed through the AI Fix-Lint flow rather than autofixed). Rules cover hardcoded secrets in body or scripts, untrusted-fetch URLs, curl ... | bash install patterns, undocumented shell scripts, and missing destructive-action gates. Same no-secret-echo invariant the prompt-side security tier carries.
Skill Reports + Shields.io badge
Save a Behavioral Eval run as a public Skill Report at /share/skill-report/<id>. Same conservative defaults as Critique and Compare shares: noindex by default, body NOT published until the original author opts in, owner-revocable at any time. Per-share OG image renders for X and LinkedIn unfurls.
The same artifact backs a Shields.io README badge:
Color band: copper at ≥8.0 (production-ready), yellow at 6.0-7.9, red below 6.0. Revoking the share kills the badge on the next request.
Importers · paste, drop, or browse
Three formats accepted from the Import button on the Skills list page: canonical SKILL.md (with frontmatter), OpenAI Custom GPT JSON export, and Gemini Gem JSON export. Drop a real Anthropic SKILL.md zip and the bundle is extracted client-side; plain SKILL.md and JSON files load straight in. The skill's source format is recorded so the D4 Cross-Provider Portability score can flag provider-specific syntax that won't translate cleanly. 5 MB upload cap.
Tags
Skills share the workspace tag namespace with Prompts. Add tags from the meta sidebar's chip editor; tag chips render on each card; the /skills list page has a tag-filter dropdown. Deleting a workspace tag removes it from every prompt and skill that used it.
Eval-pass badge
When the most-recent Behavioral Eval on a skill comes back green (all three aggregate scores ≥ 7.5), an EVAL OK chip appears on the skill's card and a "Most recent eval · Passing" row appears in its meta sidebar. Hover the chip for the actual score line.
Public REST API + SDK
The full Skills surface is on /api/v1 and in the TypeScript SDK from day one. List, create, fetch, update, delete, version history, byte-exact version snapshot, kick off a Behavioral Eval, and read a completed report · all Bearer-auth, same shape as the Prompts endpoints.
import { Assay } from "@promptassay/sdk";
const assay = new Assay({ apiKey: process.env.PROMPTASSAY_API_KEY });
// List + create
await assay.skills.list({ per_page: 25 });
const { data: skill } = await assay.skills.create({
name: "pdf-extractor",
description: "Extracts text from PDF documents the user uploads.",
});
// Pin a version + run evals from CI
const { data: versions } = await assay.skills.listVersions(skill.id);
const { data: report } = await assay.skills.evaluate(skill.id, {
versionId: versions[0].id,
versionNumber: versions[0].version_number,
models: ["claude-opus-4-7", "gpt-5-1", "gemini-2.5-pro"],
triggerProbes: [{ id: "p1", prompt: "Extract text from this PDF" }],
nonTriggerProbes: [{ id: "n1", prompt: "Write a haiku about elephants" }],
});
// Or fetch any prior report by id
await assay.skillEvals.get(report.id);A GitHub Action that posts the Skill Report as a PR comment is queued next.
2 · Improve can extract sections of SKILL.md into new files
When the AI Pair tells you "this section belongs in references/auth.md," you used to copy the text, click Add file in the file tree, paste, then come back and prune SKILL.md by hand. Improve now does that whole sequence on a single Apply click.
Skill Improve suggestions in the new shape carry an "Extracts to references/..." or "Creates scripts/..." chip on the suggestion card. Click Apply and:
- The proposed path runs through the same whitelist the file tree's Add dialog uses · path traversal, leading slashes, length-over-80, and disallowed extensions all reject before anything mutates.
- Bundle byte caps re-check. If the extraction would push you over (1 MiB per file, 4 MiB total), Apply refuses and tells you what to trim.
- The new file lands in your file tree at the proposed path.
SKILL.mdgets the pointer text (typicallySee [references/auth.md](references/auth.md)), and the editor flashes the spliced region. - Save commits a new immutable version with the lean
SKILL.mdplus the new file. The next Critique scores the smaller bundle against D5 Token Efficiency directly.
When the AI authors a brand-new script, you confirm first
The riskiest shape is create-new on a .py / .js / .ts / .sh file the AI authored from scratch with no SKILL.md region anchoring it. Apply opens a confirmation modal showing the full script content, an "author-review recommended" warning, and a Cancel / Create script button pair. The linter scans every new file the next render, but linter findings are advisory · you are the gate.
This dialog only fires for scripts authored from scratch. Moving a region you already wrote skips it. So does creating a new references/ markdown file, since markdown is read-only context for the model.
Apply All ignores extract suggestions
Apply All in the Improve panel still works for in-place edits, but it skips extract suggestions deliberately. Each extraction creates a new file, and create-new scripts need a per-suggestion confirmation. Apply those one at a time so you can review what's being created.
Editor language modes
Small companion change: when you click into a .py, .js, .ts, or .md file in the bundle, the CodeMirror editor uses the matching language mode. Python keyword highlighting, JavaScript / TypeScript bracket matching, markdown header styling. Shell scripts and unknown extensions still get the plain editor.
3 · Skill bundle export · Download button + assay pull CLI
A skill you author in Prompt Assay belongs in your repo · that's where Claude Code, your CI, and your teammates pick it up. Two new paths get a skill from PA into your repo cleanly.
Path 1 · Download button
A Download button now sits in the file tree's BUNDLE header. Click it, your browser downloads <skill-name>.zip, and you unzip it where the skill should live. The zip's directory layout matches the agentskills.io specification:
<skill-name>/
├── SKILL.md
├── references/...
└── scripts/...Unzip into ~/.claude/skills/ and Claude Code finds the skill at the canonical path on next session start.
Path 2 · assay pull CLI
For CI, post-merge hooks, or any workflow where clicking a button isn't practical, the @prompt-assay/sdk package ships an assay binary with a pull subcommand:
# One-shot, no install required
export PROMPTASSAY_API_KEY=pa_live_...
npx @prompt-assay/sdk@latest assay pull <skill-uuid>
# Specify a target directory
npx @prompt-assay/sdk@latest assay pull <skill-uuid> --out ~/code/repo/.claude/skills
# Overwrite an existing skill folder
npx @prompt-assay/sdk@latest assay pull <skill-uuid> --forceThe CLI authenticates with your Prompt Assay API key (org-bound, no GitHub scope), defaults the target to ./.claude/skills/<name>/, and refuses to clobber an existing folder unless you pass --force. Drop it into a GitHub Action with a scheduled cron and the workflow refreshes your skills from PA on whatever cadence you want.
Why no GitHub OAuth integration
Deliberate. The 2026 ecosystem (Cursor, VS Code Copilot, Promptfoo, Braintrust, Anthropic's own Skills marketplace) treats your repo as source-of-truth and platform as preview. Holding write-scoped GitHub tokens for paying customers contradicts our BYOK-as-ethics posture (your keys, your bill, no middleman) and would make Prompt Assay a high-value phishing target · the September 2025 Salesloft/Drift breach hit 700+ orgs through stolen GitHub OAuth tokens. The CLI keeps every credential user-side. You authenticate against PA with your PA API key. The CLI writes locally. PA never receives a GitHub credential.
The export path uses standard zip-slip defenses: rejects path traversal, rejects absolute paths inside the archive, and verifies every extracted file resolves inside the target directory.
4 · Evaluation suites and Behavioral Eval open on every tier
You can now author and run evaluation suites on every tier, including Free. Same for regression testing on every new version, and same for the multi-provider Behavioral Eval on Skills. The "New Test Suite" button stops being disabled. Gating creation behind Team was a friction point that hurt the developers we most want to convert without meaningfully protecting cost · BYOK already routes provider cost straight to your account, not ours. The 250 monthly AI calls cap on Free is the natural cost lever; if you actually use the feature, you'll likely upgrade to Solo for unlimited calls.
What's open on Free
The full evaluation pipeline. Author a test suite for any prompt. Add weighted criteria. Run sync or via the Anthropic Batches API for the 50% Batches discount. Score with LLM-as-a-judge against your six-dimension rubric. The judge call honors your Workbench Model at every tier, same as before; the cost-warning banner still surfaces if you pick an expensive Workbench Model and the suite would land a surprise bill.
Regression testing follows the same posture. Run version N versus N+1 against your existing suite, get per-case verdicts (improved / unchanged / regressed / new / failed), and toggle auto-on-version so a regression sweep fires every time you accept an AI rewrite.
What changes for Behavioral Eval
The per-run cap on Behavioral Eval is now uniform across every tier: up to 5 models, up to 10 trigger probes, and up to 6 non-trigger probes per run. Cross-provider testing was the use case the tighter Free and Solo caps blocked, and that's the use case Behavioral Eval is for. The pre-run cost estimate and the user-grain rate limit stay in place as the real cost levers.
The Free monthly cap is still load-bearing. A max-config Behavioral Eval run with 5 models × 16 probes = 80 cells × 2 calls per cell = 160 LLM calls. That's 64% of a Free user's 250-call monthly budget in a single run. To keep that visible, the Behavioral Eval panel now shows an inline "Monthly usage · X of 250 calls" banner on Free workspaces during setup, with the projected total after the configured run so you can see whether you'll overrun before you click Run. Color escalates at 70% (warning) and 90% (error).
What's next
A GitHub Action that calls evaluate from CI and posts the Skill Report as a PR comment ships in the next batch. After that, MCP (Model Context Protocol) joins Skills as a third peer artifact in the sidebar under the same architectural pattern · author, lint, critique, evaluate, share.
Related entries
Demo mode, cost drill-down, the GPT-5 lineage, and a much sharper AI pair
A big week of ships: free demo runs for new accounts, per-run cost receipts, GPT-5 and Gemini 2.5 Flash-Lite, shareable Critique and Compare, plus cross-model testing in the Playground.
Vote on what we ship next
A two-board feedback system where you can file bugs, request features, and vote on what we ship next. Plus a brainstorm timeout fix you may have hit on long Opus runs.
Schedule eval batches without watching the timer
The new Schedule batch action queues eval-suite runs through Anthropic's Batches API so you can walk away. Results post back when the batch returns.