Skills authoring + multi-provider Behavioral Eval (L1A)
Skills workbench: author SKILL.md bundles, lint and critique on six dimensions, and run a Behavioral Eval that scores activation across Claude, GPT, and Gemini.
You can now author Skills in Prompt Assay. Agent Skills are the multi-file capability bundles from the agentskills.io spec (a SKILL.md plus optional scripts/ and references/) that Anthropic, OpenAI Codex, Cursor, VS Code, GitHub Copilot, and Gemini CLI now all consume. Prompt Assay gives you the authoring surface, the linter, the critique, the multi-provider Behavioral Eval, and a public Skill Report you can post in a README.
The new Skills entry sits next to Prompts in the sidebar as a peer artifact type. MCP joins as a third peer next.
Six-dimension Skills critique
Same scorecard pattern as the prompt-side Critique, with skill-specific definitions:
- D1 Discovery Fidelity: will the agent reliably find this skill when the user's request matches its intent?
- D2 Instruction Quality: clear, imperative, actionable?
- D3 Example Coverage: enough worked examples for the most common invocation paths?
- D4 Cross-Provider Portability: will Claude, GPT, and Gemini all read it the same way?
- D5 Token Efficiency: reasonable bundle size, no redundant context?
- D6 Security & Safety Posture: secrets handled, destructive operations gated?
Same Workbench Model honoring, same threshold model (≤7 MUST, 7-7.5 SHOULD, 8 MAY for Critique; <8.5 MUST for Improve), same SSE streaming as prompts.
Multi-provider Behavioral Eval
The load-bearing differentiator. Pick up to 5 BYOK models, author trigger probes (positive cases · requests where the skill SHOULD activate) and non-trigger probes (negative cases · requests where it should stay dormant), then run. Caps are uniform across every paid tier (5 models × 10 trigger × 6 non-trigger probes per run); the real cost lever is the same monthly AI-call cap that gates every other action (Free 250, Solo and above unlimited on your own keys).
The grid streams every (probe × model) cell live. An inner LLM-judge call per cell scores activated: true|false plus a 0-10 adherence score. When all cells finish, three aggregate scores compute:
- Discovery Fidelity: % of probes where activation matched expectation (catches both false-negative and false-positive triggering)
- Instruction Adherence: average adherence across activated trigger probes
- Cross-Provider Consistency: 10 minus the standard deviation of adherence across providers
Tier caps prevent runaway costs at the upper bound. Pre-run cost estimate is inline; runs over $5 require a confirmation tap. Probe scripts are NOT executed; Prompt Assay does not operate a sandbox.
16-rule Skills linter, six security rules
The Skills linter ships with a six-rule security tier prefixed security-skill-* (Shield glyph in the lint panel, no autofix, route through the AI Fix-Lint flow), inheriting the same X2/R4 invariants the prompt-side security rules carry, including the contractual no-secret-echo rule. Rules cover: hardcoded secrets in body or scripts, untrusted-fetch URLs, curl ... | bash install patterns, undocumented shell scripts, and missing destructive-action gates.
Skill Reports + Shields.io badge
Save a Behavioral Eval run as a public Skill Report at /share/skill-report/<id>. Same conservative defaults as Critique and Compare shares: noindex by default, body NOT published until the original author opts in, owner-revocable from /settings/shares. Per-share OG image renders for X and LinkedIn unfurls.
The same artifact backs a Shields.io badge. Drop this in your skill's README:
Color band: copper at ≥8.0 (production-ready), yellow at 6.0-7.9, red below 6.0. Revoking the share kills the badge on the next request.
Importers (paste, drop, or browse)
Three formats accepted from the Import button on the Skills list page: canonical SKILL.md (with frontmatter), OpenAI Custom GPT JSON export, and Gemini Gem JSON export. The skill's source_format is recorded so the D4 Cross-Provider Portability score can flag provider-specific syntax that won't translate cleanly.
The dialog accepts paste, file picker, or drag-drop. Drop a real Anthropic SKILL.md zip and the bundle is extracted client-side via jszip; the first <name>/SKILL.md entry becomes the import payload (additional scripts/ and references/ files in the zip are paste-based for now). Plain SKILL.md and JSON files load straight in. 5 MB upload cap.
Frontmatter tags: arrays round-trip · ingested on import, re-emitted on export · so a skill moved between PA workspaces preserves its tag list.
Tags
Skills now share the workspace tag namespace with Prompts. Add tags from the SkillMeta sidebar's chip editor; tag chips render on each SkillCard; the /skills list page has a tag-filter dropdown next to the folder filter. The same tag list, the same per-tag color palette as /prompts. Migration 055 adds the skill_tags junction; deleting a workspace tag cascades through both prompt_tags and skill_tags so the chip disappears everywhere it was used.
Eval-pass badge
When the most-recent Behavioral Eval on a skill comes back green (status complete AND all three aggregate scores ≥ 7.5), an EVAL OK chip appears on the skill's card and a "Most recent eval · Passing" status row appears in SkillMeta. Hover the chip for the actual score line. Skills below the bar or with no eval yet stay un-badged.
Public REST API + SDK
Eight new endpoints under /api/v1, plus a new SDK namespace:
GET /api/v1/skills· paginated list with folder, source-format, and archived filtersPOST /api/v1/skills· create with optional initial SKILL.md bodyGET /api/v1/skills/{id}· fetch with the current bundle (SKILL.md + scripts + references)PATCH /api/v1/skills/{id}· rename / move / archive / favoriteDELETE /api/v1/skills/{id}· hard delete (cascades versions, files, eval reports)GET /api/v1/skills/{id}/versions· immutable version history (metadata only)GET /api/v1/skills/{id}/versions/{version}· byte-exact version snapshot with filesPOST /api/v1/skills/{id}/evaluate· kick off a Behavioral Eval and return the completed report (synchronous; 5-minute ceiling matches the in-product surface)GET /api/v1/skill-evals/{id}· read a completed report's aggregates plus every (probe × model) cell
The TypeScript SDK gets two new namespaces and a worked example for each:
import { Assay } from "@promptassay/sdk";
const assay = new Assay({ apiKey: process.env.PROMPTASSAY_API_KEY });
// List + create
await assay.skills.list({ per_page: 25 });
const { data: skill } = await assay.skills.create({
name: "pdf-extractor",
description: "Extracts text from PDF documents the user uploads.",
});
// Pin a version + run evals from CI
const { data: versions } = await assay.skills.listVersions(skill.id);
const { data: report } = await assay.skills.evaluate(skill.id, {
versionId: versions[0].id,
versionNumber: versions[0].version_number,
models: ["claude-opus-4-7", "gpt-5-1", "gemini-2.5-pro"],
triggerProbes: [{ id: "p1", prompt: "Extract text from this PDF" }],
nonTriggerProbes: [{ id: "n1", prompt: "Write a haiku about elephants" }],
});
// Or fetch any prior report by id
await assay.skillEvals.get(report.id);A GitHub Action (promptassay/skill-critique@v1) that calls evaluate from CI and posts the report as a PR comment ships immediately after this release.
Quietly underneath
- Schema: seven new tables (
skills,skill_versions,skill_files,skill_eval_reports,skill_eval_results,shareable_skill_reports,skill_tags) across migrations 050-052 + 055. Theskill_files.pathcolumn is whitelisted via DB CHECK toSKILL.md,scripts/*.{py,js,ts,sh}, orreferences/*.md(defense in depth against path traversal).skill_tagsis the Skills↔Tags junction; it reuses the existing workspacetagstable, so the same tag rows back both Prompts and Skills. - The generic
increment_share_view(p_kind, p_id)RPC was extended with theskill-reportbranch; no parallel RPC was added. - The Critique, Improve, Rewrite, and ScoreRadar components now accept
labels+variantprops. The same components serve both the prompt and skill surfaces; no parallel files ship. - Cron retention adds a 90-day prune for
skill_eval_results(the high-volume per-cell table). Parentskill_eval_reportsrows are retained forever to back the public share artifact. - The GDPR data-export ZIP now includes
skills.json,skill_versions.json,skill_files.json,skill_eval_reports.json,skill_eval_results.json, andshare_skill_reports.json. - Behavioral Eval scripts are mentioned to the model as available but never executed. Prompt Assay does not operate a sandbox.
What's next
L1A.2 (the GitHub Action promptassay/skill-critique@v1 for posting Skill Reports as PR comments) ships in the next batch. After that, L1B (MCP server trust + behavioral-quality workbench) joins Skills under the same architectural pattern.
Related entries
Demo mode, cost drill-down, the GPT-5 lineage, and a much sharper AI pair
A big week of ships: free demo runs for new accounts, per-run cost receipts, GPT-5 and Gemini 2.5 Flash-Lite, shareable Critique and Compare, plus cross-model testing in the Playground.
Vote on what we ship next
A two-board feedback system where you can file bugs, request features, and vote on what we ship next. Plus a brainstorm timeout fix you may have hit on long Opus runs.
Schedule eval batches without watching the timer
The new Schedule batch action queues eval-suite runs through Anthropic's Batches API so you can walk away. Results post back when the batch returns.