I.Opening

Ship prompts and Agent Skillsthat hold up in production.

─── this is where they get written.

The authoring workbench. Critique, version, and evaluate both on your keys, across Claude, GPT, and Gemini.

6
in-editor
AI actions
6
critique
dimensions
3
LLM
providers
BYOK · Free tier · No demo call
Run your prompts and Skills on any major model.
Anthropic ClaudeOpenAI GPTGoogle GeminiAnthropic ClaudeOpenAI GPTGoogle GeminiAnthropic ClaudeOpenAI GPTGoogle Gemini
Your keys. Your bill. No markup.
II.Try the Critique

Find the leaks before they ship.

Every prompt scored on six dimensions: clarity, completeness, structure, technique usage, robustness, efficiency. Each with a written reason. The radar shows where the prompt leaks.

EXHIBIT · CRITIQUE PANEL

Try it yourself: three drafts of one task at different quality levels.

Weak draftFirst pass, lots of leaks
You are a customer support assistant.
Help the customer.
Be friendly and helpful.
Always answer the question.
Don't be rude.
Overall score
3.5/ 10
Poor
Dimensions · weakest first
  • Technique Usage2/10

    No XML tags, no few-shot, no chain-of-thought, no output format.

  • Completeness3/10

    No format, no boundaries, no examples, no input slot.

  • Robustness3/10

    No handling for missing, ambiguous, or hostile input.

  • Clarity4/10

    Role and goal are vague. "Help" is underspecified.

  • Structure4/10

    Flat list of imperatives. No sections or hierarchy.

  • Efficiency4/10

    Short, but vague enough that outputs will balloon.

Excerpts shown. Real prompts run longer. Inside the editor, run the critique any time.

III.A Different Lane

Everyone moved upstream or downstream. We stayed where the craft lives.

Every other tool moved upstream to agents or downstream to traces. The middle, where prompts and Agent Skills get written, is empty. We sit there.

You open the editor. You write. The AI pair brainstorms, critiques, improves, rewrites, and compares · on both prompts and Skills. You version, evaluate, and ship. Everything else, we leave to everyone else.

An AI pair you can talk to, not just trigger · ask where the artifact should go and the suggestions land inline as accept-or-reject diff hunks. Most workbenches that ship a chat companion put the rewrite in a separate panel; we put it in the prompt or the SKILL.md itself.

The category, end to end
Upstream · The agent
Agent frameworks
Orchestration, memory, planning loops.
Here · The artifact
Authoring
Where prompts and Skills get written, assayed, and shipped.
Downstream · The output
Logs and traces
Observability, replay, eval analytics.
IV.Side by Side

Run one prompt across every provider. Let a judge call the winners.

Pick a prompt. Pick 2-5 BYOK models across Anthropic, OpenAI, and Google. Run them all in parallel; output, latency, tokens, and cost stream side by side. Optional second pass: an LLM-as-Judge scores every output against your rubric and surfaces per-criterion winners as checkmarks.

Save the report. Share the public link. Or just use the run to decide which provider ships this prompt, for which criteria, at what cost.

Separately, the AI pair’s two-version Compare runs the same model-graded diff between two revisions of one prompt · useful when you’ve just rewritten one and want to know whether it actually improved.

EXHIBIT · COMPARE MODELS · 2026-05
V.Skills, Same Workbench

Author Agent Skills. Run evals across every provider. Ship the badge.

Agent Skills are the multi-file capability bundles from the agentskills.io open spec · a SKILL.md plus optional scripts/ and references/ · that Anthropic, OpenAI Codex, Cursor, VS Code, GitHub Copilot, and Gemini CLI all consume. Prompt Assay gives you the cross-provider workbench around them: author the bundle, lint with a seven-rule security tier (no hardcoded secrets, no curl ... | bash), critique on six dimensions, and run the Behavioral Eval that scores how reliably each provider activates the skill on the cases that matter.

The Behavioral Eval runs your skill against positive cases (the requests the skill should activate on) and negative cases (the requests it should stay dormant on) across the providers you pick. A judge call per cell scores activation and instruction adherence. Three aggregate scores fall out: Discovery Fidelity, Instruction Adherence, and Cross-Provider Consistency. A neutral grader · scoring the same skill against Claude, GPT, and Gemini side by side · is something no first-party tool ships, because each provider only knows its own model.

Already have a working prompt? The AI pair's Convert action decomposes it into a complete Skill bundle in one shot · SKILL.md, scripts, and references · with a preview gate before anything lands in your library. Bridge from prompt to Skill without rewriting from scratch.

Save the run as a public Skill Report · noindex by default, owner-revocable, opt-in body publish. Drop the Shields.io README badge in your repo so the score on your skill stays in sync with the latest published run. Same workbench. Same six-dimension scorecard pattern. Same BYOK ethics: every cell runs on your provider keys, every bill goes to your provider account.

VI.The Procedure

A day in the life of your artifacts.

Prompts and Agent Skills share a workbench but follow different lifecycles. Both end on production; neither lives in a chat window.

PromptsAuthor to ship in six stations.
Author

Talk it through with the AI pair, then author in the editor.

Critique

Score it on six dimensions. See where it leaks and why.

Improve

Apply targeted edits one by one, or accept the full rewrite.

Run

Stream output from any provider. Track cost, latency, and run history.

Assay

Run a suite. See pass-fail per case and the aggregate roll-up.

Ship

Version it, tag it, pull it via API or SDK from your app.

Agent SkillsAuthor to badge in six stations.
Author

Compose SKILL.md with the AI pair, attach scripts and references.

Lint

Nineteen rules including a seven-rule security tier. On save and on demand.

Critique

Six dimensions tuned for capability bundles, not single-prompt text.

Improve

Apply targeted rewrites or accept the full rewrite. Same instruments as prompts.

Behavioral Eval

Trigger and non-trigger probes across Claude, GPT, and Gemini in parallel.

Badge

Publish a Skill Report. Embed the Shields.io badge in your README.

VII.The Keys

Your keys. Your bill. No markup on a single token.

  • BYOK on every tier · including free. Inference is billed by your provider, never by us.

  • Keys are encrypted per organization. Never logged, never sent to a client, never used outside the workflows you trigger.

  • Draft once, run on Anthropic, OpenAI, or Google. Switch the model from a menu, not a migration.

Your provider connections
AnthropicClaude Sonnet 4.6
connected
OpenAIGPT-4.1
connected
GoogleGemini 2.5 Pro
connected
Billed to your provider$0.00 from us
VIII.Solo and Teams

One workbench. Two ways to work in it.

For the prompt engineer

Ship production prompts and Skills by yourself · no team required.

  • Personal library of prompts and Agent Skills with fragments and versions.
  • The full AI pair on both artifacts: brainstorm, critique, improve, rewrite, compare, convert.
  • Multi-provider Behavioral Eval for Skills across Claude, GPT, and Gemini.
  • Every provider, one editor, BYOK.
  • Public API and TypeScript SDK for prompts and Skills when you're ready to ship.
Starts free. Solo tier at $49 / month, unlimited calls on your keys.
Start solo
For the team

Review prompts and Skills the way you review code.

  • Org workspaces with owner, admin, and member roles.
  • Shared prompt and Skill libraries with versions, reviewed like code.
  • Create and run evaluation suites and Behavioral Evals the whole team can see.
  • SAML SSO and custom tier controls on Enterprise.
Team tier at $99 / seat / month. SAML SSO on Enterprise.
IX.Under Glass

Pull any version into production.

REST or TypeScript SDK. API-key auth, org-scoped, rate-limited, documented. Your code pulls the prompt or the Skill. The workbench owns the history.

[rest][typescript][prompts][skills][org-scoped][versioned]
installbash
npm install @prompt-assay/sdk
pull-prompt.tstypescript
import { Assay } from "@prompt-assay/sdk";

const client = new Assay({
  apiKey: process.env.PROMPT_ASSAY_API_KEY!,
});

// Pull the resolved prompt with all fragments assembled.
const { data: prompt } = await client.prompts.getResolved(promptId);

// Send it to your provider. Prompt Assay never touches the call.
const response = await anthropic.messages.create({
  model: "claude-sonnet-4-6",
  messages: [{ role: "user", content: prompt.resolved_content }],
});
X.On the Record

Trust, before testimonials.

Responses stay with your provider

We do not retain provider responses on our servers. The one exception: evaluation test outputs, saved with each test case so you can review past runs.

Encrypted key storage

Provider keys are encrypted at rest and never leave the server.

Role-based access

Owner, admin, and member roles enforced at every layer.

We do not train on your content

Prompt Assay does not train, fine-tune, or aggregate your content into any model. Documented in section 5 of the privacy policy.

XI.Subscription

No demo call. No card. Free tier ships every AI instrument.

Self-serve through Team. Contact us only for SSO, BAA, or DPA. BYOK at every tier · including free.

Free
$0/ month

For trying the editor and seeing what the AI pair does to your prompts and skills.

  • 7 platform-funded calls to start (no key needed)
  • Personal workspace
  • 250 AI calls a month, enough to explore
  • Every in-editor AI action on prompts
  • Skills authoring + multi-provider Behavioral Eval
  • Evaluation suites with LLM-as-a-judge scoring
  • 7-day version view
  • Single seat
  • BYOK: bring an Anthropic, OpenAI, or Google key
Start free
No card. Upgrade any time.
Most popular
Solo
$49/ month

For the prompt engineer who is shipping prompts and skills.

  • Unlimited AI calls on your keys
  • Full prompt + skill version history
  • Every in-editor AI action
  • Skills authoring + multi-provider Behavioral Eval
  • Evaluation suites with regression on every version
  • Public REST API + TypeScript SDK (prompts + skills)
  • Single seat
Start solo
BYOK required. Cancel any time.
Team
$99/ seat / month

For teams reviewing prompts and skills like code.

  • Everything in Solo
  • Shared org library for prompts + skills
  • Owner, admin, member roles
  • Shared evaluation suites across the team
  • Shared Skills + Behavioral Eval reports across the team
  • Invitations and seat management
  • 3 to 15 seats
Start a team
Minimum 3 seats. Annual billing available.
Enterprise
Contact us

For engineering orgs that need SSO and custom controls.

  • Everything in Team
  • SAML SSO
  • Unlimited seats
  • Custom tier controls
  • Data processing agreement
  • Priority support
Talk to us
Custom contract. Talk to us.
BYOK at every tier. Your keys. Your bill. No markup.
XII.Marginalia

Questions, answered plainly.

Yes for ongoing use, on every tier. New accounts get 7 platform-funded calls (Critique, Improve, Rewrite) to explore the workbench. Connect a Claude, OpenAI, or Google key when you're ready to keep going. Inference is billed by your provider, never by us.

Open the workbench. Ship prompts and Skills that hold up.

Free to start. Your keys, your bill, no demo call.