WEEK OF MAY 04Feature

Skills authoring + multi-provider Behavioral Eval (L1A)

Skills workbench: author SKILL.md bundles, lint and critique on six dimensions, and run a Behavioral Eval that scores activation across Claude, GPT, and Gemini.

You can now author Skills in Prompt Assay. Agent Skills are the multi-file capability bundles from the agentskills.io spec (a SKILL.md plus optional scripts/ and references/) that Anthropic, OpenAI Codex, Cursor, VS Code, GitHub Copilot, and Gemini CLI now all consume. Prompt Assay gives you the authoring surface, the linter, the critique, the multi-provider Behavioral Eval, and a public Skill Report you can post in a README.

The new Skills entry sits next to Prompts in the sidebar as a peer artifact type. MCP joins as a third peer next.

Six-dimension Skills critique

Same scorecard pattern as the prompt-side Critique, with skill-specific definitions:

  • D1 Discovery Fidelity: will the agent reliably find this skill when the user's request matches its intent?
  • D2 Instruction Quality: clear, imperative, actionable?
  • D3 Example Coverage: enough worked examples for the most common invocation paths?
  • D4 Cross-Provider Portability: will Claude, GPT, and Gemini all read it the same way?
  • D5 Token Efficiency: reasonable bundle size, no redundant context?
  • D6 Security & Safety Posture: secrets handled, destructive operations gated?

Same Workbench Model honoring, same threshold model (≤7 MUST, 7-7.5 SHOULD, 8 MAY for Critique; <8.5 MUST for Improve), same SSE streaming as prompts.

Multi-provider Behavioral Eval

The load-bearing differentiator. Pick up to 5 BYOK models, author trigger probes (positive cases · requests where the skill SHOULD activate) and non-trigger probes (negative cases · requests where it should stay dormant), then run. Caps are uniform across every paid tier (5 models × 10 trigger × 6 non-trigger probes per run); the real cost lever is the same monthly AI-call cap that gates every other action (Free 250, Solo and above unlimited on your own keys).

The grid streams every (probe × model) cell live. An inner LLM-judge call per cell scores activated: true|false plus a 0-10 adherence score. When all cells finish, three aggregate scores compute:

  • Discovery Fidelity: % of probes where activation matched expectation (catches both false-negative and false-positive triggering)
  • Instruction Adherence: average adherence across activated trigger probes
  • Cross-Provider Consistency: 10 minus the standard deviation of adherence across providers

Tier caps prevent runaway costs at the upper bound. Pre-run cost estimate is inline; runs over $5 require a confirmation tap. Probe scripts are NOT executed; Prompt Assay does not operate a sandbox.

16-rule Skills linter, six security rules

The Skills linter ships with a six-rule security tier prefixed security-skill-* (Shield glyph in the lint panel, no autofix, route through the AI Fix-Lint flow), inheriting the same X2/R4 invariants the prompt-side security rules carry, including the contractual no-secret-echo rule. Rules cover: hardcoded secrets in body or scripts, untrusted-fetch URLs, curl ... | bash install patterns, undocumented shell scripts, and missing destructive-action gates.

Skill Reports + Shields.io badge

Save a Behavioral Eval run as a public Skill Report at /share/skill-report/<id>. Same conservative defaults as Critique and Compare shares: noindex by default, body NOT published until the original author opts in, owner-revocable from /settings/shares. Per-share OG image renders for X and LinkedIn unfurls.

The same artifact backs a Shields.io badge. Drop this in your skill's README:

![Critiqued in PA](https://img.shields.io/endpoint?url=https%3A%2F%2Fpromptassay.ai%2Fapi%2Fbadge%2Fskill%2F<your-share-id>)

Color band: copper at ≥8.0 (production-ready), yellow at 6.0-7.9, red below 6.0. Revoking the share kills the badge on the next request.

Importers (paste, drop, or browse)

Three formats accepted from the Import button on the Skills list page: canonical SKILL.md (with frontmatter), OpenAI Custom GPT JSON export, and Gemini Gem JSON export. The skill's source_format is recorded so the D4 Cross-Provider Portability score can flag provider-specific syntax that won't translate cleanly.

The dialog accepts paste, file picker, or drag-drop. Drop a real Anthropic SKILL.md zip and the bundle is extracted client-side via jszip; the first <name>/SKILL.md entry becomes the import payload (additional scripts/ and references/ files in the zip are paste-based for now). Plain SKILL.md and JSON files load straight in. 5 MB upload cap.

Frontmatter tags: arrays round-trip · ingested on import, re-emitted on export · so a skill moved between PA workspaces preserves its tag list.

Tags

Skills now share the workspace tag namespace with Prompts. Add tags from the SkillMeta sidebar's chip editor; tag chips render on each SkillCard; the /skills list page has a tag-filter dropdown next to the folder filter. The same tag list, the same per-tag color palette as /prompts. Migration 055 adds the skill_tags junction; deleting a workspace tag cascades through both prompt_tags and skill_tags so the chip disappears everywhere it was used.

Eval-pass badge

When the most-recent Behavioral Eval on a skill comes back green (status complete AND all three aggregate scores ≥ 7.5), an EVAL OK chip appears on the skill's card and a "Most recent eval · Passing" status row appears in SkillMeta. Hover the chip for the actual score line. Skills below the bar or with no eval yet stay un-badged.

Public REST API + SDK

Eight new endpoints under /api/v1, plus a new SDK namespace:

  • GET /api/v1/skills · paginated list with folder, source-format, and archived filters
  • POST /api/v1/skills · create with optional initial SKILL.md body
  • GET /api/v1/skills/{id} · fetch with the current bundle (SKILL.md + scripts + references)
  • PATCH /api/v1/skills/{id} · rename / move / archive / favorite
  • DELETE /api/v1/skills/{id} · hard delete (cascades versions, files, eval reports)
  • GET /api/v1/skills/{id}/versions · immutable version history (metadata only)
  • GET /api/v1/skills/{id}/versions/{version} · byte-exact version snapshot with files
  • POST /api/v1/skills/{id}/evaluate · kick off a Behavioral Eval and return the completed report (synchronous; 5-minute ceiling matches the in-product surface)
  • GET /api/v1/skill-evals/{id} · read a completed report's aggregates plus every (probe × model) cell

The TypeScript SDK gets two new namespaces and a worked example for each:

import { Assay } from "@promptassay/sdk";
const assay = new Assay({ apiKey: process.env.PROMPTASSAY_API_KEY });
 
// List + create
await assay.skills.list({ per_page: 25 });
const { data: skill } = await assay.skills.create({
  name: "pdf-extractor",
  description: "Extracts text from PDF documents the user uploads.",
});
 
// Pin a version + run evals from CI
const { data: versions } = await assay.skills.listVersions(skill.id);
const { data: report } = await assay.skills.evaluate(skill.id, {
  versionId: versions[0].id,
  versionNumber: versions[0].version_number,
  models: ["claude-opus-4-7", "gpt-5-1", "gemini-2.5-pro"],
  triggerProbes: [{ id: "p1", prompt: "Extract text from this PDF" }],
  nonTriggerProbes: [{ id: "n1", prompt: "Write a haiku about elephants" }],
});
 
// Or fetch any prior report by id
await assay.skillEvals.get(report.id);

A GitHub Action (promptassay/skill-critique@v1) that calls evaluate from CI and posts the report as a PR comment ships immediately after this release.

Quietly underneath
  • Schema: seven new tables (skills, skill_versions, skill_files, skill_eval_reports, skill_eval_results, shareable_skill_reports, skill_tags) across migrations 050-052 + 055. The skill_files.path column is whitelisted via DB CHECK to SKILL.md, scripts/*.{py,js,ts,sh}, or references/*.md (defense in depth against path traversal). skill_tags is the Skills↔Tags junction; it reuses the existing workspace tags table, so the same tag rows back both Prompts and Skills.
  • The generic increment_share_view(p_kind, p_id) RPC was extended with the skill-report branch; no parallel RPC was added.
  • The Critique, Improve, Rewrite, and ScoreRadar components now accept labels + variant props. The same components serve both the prompt and skill surfaces; no parallel files ship.
  • Cron retention adds a 90-day prune for skill_eval_results (the high-volume per-cell table). Parent skill_eval_reports rows are retained forever to back the public share artifact.
  • The GDPR data-export ZIP now includes skills.json, skill_versions.json, skill_files.json, skill_eval_reports.json, skill_eval_results.json, and share_skill_reports.json.
  • Behavioral Eval scripts are mentioned to the model as available but never executed. Prompt Assay does not operate a sandbox.

What's next

L1A.2 (the GitHub Action promptassay/skill-critique@v1 for posting Skill Reports as PR comments) ships in the next batch. After that, L1B (MCP server trust + behavioral-quality workbench) joins Skills under the same architectural pattern.

promptassay.aiChangelogWeek of MAY 04 · 2026By Jon Lasley