OPEN SOURCE · SKILL REGRESSION TESTING
Catch skill regressions.
Keep the proof.
HOW IT WORKS
RUN
Run real agent sessions after every SKILL.md change. Check what the agent did, not what you expected.
ASSERT
Use TypeScript assertions to check the right skill was loaded, the right files were read, the right commands ran, and the final answer was correct.
INSPECT
When a skill edit changes behavior, compare artifacts, inspect the run, and see how token usage changed.
RUNNER
When you change a SKILL.md, rerun the same cases and see what changed. skillgym records each run so failed behavior is easy to inspect.
npm install --save-dev skillgym npx skillgym run ./skillgym/my-suite.ts RUNNERS
The same skill edit can behave differently on each runner. Compare OpenCode, Codex, Claude Code, and Cursor Agent on the same cases before you trust the change.
OpenCode
an open source AI coding agent
Codex
OpenAI's terminal coding agent
Claude Code
Anthropic's coding agent CLI
Cursor Agent
Cursor CLI agent mode
import { assert } from "skillgym";
// assert the skill was loaded
assert.skills.has(report, "agent-device");
// assert the reference was loaded
assert.fileReads.includes(report, /bootstrap-install\.md$/);
// assert commands ran in the right order
assert.commands.before(
report,
/^agent-devices+open/,
/^agent-devices+snapshot/
);
assert.commands.before(
report,
/^agent-devices+snapshot/,
/^agent-devices+close/
); ASSERTIONS
The right skill. The right files. The right steps. Use assertions to turn expectations into pass/fail checks after every SKILL.md edit.
assert.skills.* — check which skills were usedassert.commands.* — check commands and orderingassert.fileReads.* — check what the agent openedassert.toolCalls.* — inspect tool usage in detailassert.output.* — check the final answerSNAPSHOTS
Set a baseline and catch when a skill edit makes the agent much more expensive. Track token growth before it becomes the baseline.
npx skillgym run ./suite.ts --update-snapshots
WORKSPACES
Use isolated workspaces when repo state matters, or share one when you want speed. Make skill failures easier to reproduce and debug.
export const workspace = {
mode: "isolated",
templateDir: "./fixtures/base",
bootstrap: {
command: "npm",
args: ["install"],
},
}; QUICK START
01
npm install --save-dev skillgym
02
// skillgym.config.ts
import type { SkillgymConfig } from "skillgym";
const config: SkillgymConfig = {
runners: {
"my-agent": {
agent: {
type: "opencode",
model: "openai/gpt-4o",
},
},
},
};
export default config; 03
// skillgym/my-suite.ts
import { assert, type Case } from "skillgym";
const suite: Case[] = [
{
id: "agent-device-settings-check",
prompt: [
"Use agent-device to open Settings,",
'verify "Privacy" is visible with snapshot,',
"then close the session.",
].join(" "),
async assert(report) {
assert.skills.has(report, "agent-device");
assert.commands.before(
report,
/^agent-devices+open/,
/^agent-devices+snapshot/
);
assert.commands.before(
report,
/^agent-devices+snapshot/,
/^agent-devices+close/
);
},
},
];
export default suite; OPEN SOURCE · MIT LICENSE
Catch skill regressions, compare agents, and keep proof from every run.