Ship prompts and Agent Skillsthat hold up in production.
The authoring workbench. Critique, version, and evaluate both on your keys, across Claude, GPT, and Gemini.
- 6
- in-editor
AI actions - 6
- critique
dimensions - 3
- LLM
providers
Find the leaks before they ship.
Every prompt scored on six dimensions: clarity, completeness, structure, technique usage, robustness, efficiency. Each with a written reason. The radar shows where the prompt leaks.
Try it yourself: three drafts of one task at different quality levels.
You are a customer support assistant. Help the customer. Be friendly and helpful. Always answer the question. Don't be rude.
- Technique Usage2/10
No XML tags, no few-shot, no chain-of-thought, no output format.
- Completeness3/10
No format, no boundaries, no examples, no input slot.
- Robustness3/10
No handling for missing, ambiguous, or hostile input.
- Clarity4/10
Role and goal are vague. "Help" is underspecified.
- Structure4/10
Flat list of imperatives. No sections or hierarchy.
- Efficiency4/10
Short, but vague enough that outputs will balloon.
Excerpts shown. Real prompts run longer. Inside the editor, run the critique any time.
Everyone moved upstream or downstream. We stayed where the craft lives.
Every other tool moved upstream to agents or downstream to traces. The middle, where prompts and Agent Skills get written, is empty. We sit there.
You open the editor. You write. The AI pair brainstorms, critiques, improves, rewrites, and compares · on both prompts and Skills. You version, evaluate, and ship. Everything else, we leave to everyone else.
An AI pair you can talk to, not just trigger · ask where the artifact should go and the suggestions land inline as accept-or-reject diff hunks. Most workbenches that ship a chat companion put the rewrite in a separate panel; we put it in the prompt or the SKILL.md itself.
Run one prompt across every provider. Let a judge call the winners.
Pick a prompt. Pick 2-5 BYOK models across Anthropic, OpenAI, and Google. Run them all in parallel; output, latency, tokens, and cost stream side by side. Optional second pass: an LLM-as-Judge scores every output against your rubric and surfaces per-criterion winners as checkmarks.
Save the report. Share the public link. Or just use the run to decide which provider ships this prompt, for which criteria, at what cost.
Separately, the AI pair’s two-version Compare runs the same model-graded diff between two revisions of one prompt · useful when you’ve just rewritten one and want to know whether it actually improved.
Author Agent Skills. Run evals across every provider. Ship the badge.
Agent Skills are the multi-file capability bundles from the agentskills.io open spec · a SKILL.md plus optional scripts/ and references/ · that Anthropic, OpenAI Codex, Cursor, VS Code, GitHub Copilot, and Gemini CLI all consume. Prompt Assay gives you the cross-provider workbench around them: author the bundle, lint with a seven-rule security tier (no hardcoded secrets, no curl ... | bash), critique on six dimensions, and run the Behavioral Eval that scores how reliably each provider activates the skill on the cases that matter.
The Behavioral Eval runs your skill against positive cases (the requests the skill should activate on) and negative cases (the requests it should stay dormant on) across the providers you pick. A judge call per cell scores activation and instruction adherence. Three aggregate scores fall out: Discovery Fidelity, Instruction Adherence, and Cross-Provider Consistency. A neutral grader · scoring the same skill against Claude, GPT, and Gemini side by side · is something no first-party tool ships, because each provider only knows its own model.
Already have a working prompt? The AI pair's Convert action decomposes it into a complete Skill bundle in one shot · SKILL.md, scripts, and references · with a preview gate before anything lands in your library. Bridge from prompt to Skill without rewriting from scratch.
Save the run as a public Skill Report · noindex by default, owner-revocable, opt-in body publish. Drop the Shields.io README badge in your repo so the score on your skill stays in sync with the latest published run. Same workbench. Same six-dimension scorecard pattern. Same BYOK ethics: every cell runs on your provider keys, every bill goes to your provider account.
A day in the life of your artifacts.
Prompts and Agent Skills share a workbench but follow different lifecycles. Both end on production; neither lives in a chat window.
Talk it through with the AI pair, then author in the editor.
Score it on six dimensions. See where it leaks and why.
Apply targeted edits one by one, or accept the full rewrite.
Stream output from any provider. Track cost, latency, and run history.
Run a suite. See pass-fail per case and the aggregate roll-up.
Version it, tag it, pull it via API or SDK from your app.
Compose SKILL.md with the AI pair, attach scripts and references.
Nineteen rules including a seven-rule security tier. On save and on demand.
Six dimensions tuned for capability bundles, not single-prompt text.
Apply targeted rewrites or accept the full rewrite. Same instruments as prompts.
Trigger and non-trigger probes across Claude, GPT, and Gemini in parallel.
Publish a Skill Report. Embed the Shields.io badge in your README.
Your keys. Your bill. No markup on a single token.
BYOK on every tier · including free. Inference is billed by your provider, never by us.
Keys are encrypted per organization. Never logged, never sent to a client, never used outside the workflows you trigger.
Draft once, run on Anthropic, OpenAI, or Google. Switch the model from a menu, not a migration.
One workbench. Two ways to work in it.
Ship production prompts and Skills by yourself · no team required.
- Personal library of prompts and Agent Skills with fragments and versions.
- The full AI pair on both artifacts: brainstorm, critique, improve, rewrite, compare, convert.
- Multi-provider Behavioral Eval for Skills across Claude, GPT, and Gemini.
- Every provider, one editor, BYOK.
- Public API and TypeScript SDK for prompts and Skills when you're ready to ship.
Review prompts and Skills the way you review code.
- Org workspaces with owner, admin, and member roles.
- Shared prompt and Skill libraries with versions, reviewed like code.
- Create and run evaluation suites and Behavioral Evals the whole team can see.
- SAML SSO and custom tier controls on Enterprise.
Pull any version into production.
REST or TypeScript SDK. API-key auth, org-scoped, rate-limited, documented. Your code pulls the prompt or the Skill. The workbench owns the history.
npm install @prompt-assay/sdkimport { Assay } from "@prompt-assay/sdk";
const client = new Assay({
apiKey: process.env.PROMPT_ASSAY_API_KEY!,
});
// Pull the resolved prompt with all fragments assembled.
const { data: prompt } = await client.prompts.getResolved(promptId);
// Send it to your provider. Prompt Assay never touches the call.
const response = await anthropic.messages.create({
model: "claude-sonnet-4-6",
messages: [{ role: "user", content: prompt.resolved_content }],
});Trust, before testimonials.
We do not retain provider responses on our servers. The one exception: evaluation test outputs, saved with each test case so you can review past runs.
Provider keys are encrypted at rest and never leave the server.
Owner, admin, and member roles enforced at every layer.
Prompt Assay does not train, fine-tune, or aggregate your content into any model. Documented in section 5 of the privacy policy.
No demo call. No card. Free tier ships every AI instrument.
Self-serve through Team. Contact us only for SSO, BAA, or DPA. BYOK at every tier · including free.
For trying the editor and seeing what the AI pair does to your prompts and skills.
- 7 platform-funded calls to start (no key needed)
- Personal workspace
- 250 AI calls a month, enough to explore
- Every in-editor AI action on prompts
- Skills authoring + multi-provider Behavioral Eval
- Evaluation suites with LLM-as-a-judge scoring
- 7-day version view
- Single seat
- BYOK: bring an Anthropic, OpenAI, or Google key
For the prompt engineer who is shipping prompts and skills.
- Unlimited AI calls on your keys
- Full prompt + skill version history
- Every in-editor AI action
- Skills authoring + multi-provider Behavioral Eval
- Evaluation suites with regression on every version
- Public REST API + TypeScript SDK (prompts + skills)
- Single seat
For teams reviewing prompts and skills like code.
- Everything in Solo
- Shared org library for prompts + skills
- Owner, admin, member roles
- Shared evaluation suites across the team
- Shared Skills + Behavioral Eval reports across the team
- Invitations and seat management
- 3 to 15 seats
For engineering orgs that need SSO and custom controls.
- Everything in Team
- SAML SSO
- Unlimited seats
- Custom tier controls
- Data processing agreement
- Priority support
Questions, answered plainly.
Open the workbench. Ship prompts and Skills that hold up.
Free to start. Your keys, your bill, no demo call.