OPEN SOURCE · SKILL REGRESSION TESTING

Changed the skill.
Did it still do the right thing?

Catch skill regressions.
Keep the proof.

skillgym run
$ skillgym run ./examples/basic-suite.ts
Suite examples/basic-suite.ts
Workspace ~/Projects/acme-app
Cases 3
Runners 2
Executions 6
caseexecutionsstatus
always-passes2/2passed
assertion-fails1/2failed
assert-crashes0/2failed
Runner: open-main (opencode, openai/gpt-4o)
casetimeinoutbillable
always-passes12.3s1,2408908,204
assertion-fails14.8s2,0401,10011,240
assert-crashes3.1s410120892
2 failed · 1 passed · 1m 42s
Catch skill regressionsTypeScript assertionsToken usage snapshotsCount tokens per executionSame skill, every agentProof from every executionInspect commands and file readsOpenCodeCodexClaude CodeCursor AgentReal agent sessionsIsolated workspacesOpen sourceCatch skill regressionsTypeScript assertionsToken usage snapshotsCount tokens per executionSame skill, every agentProof from every executionInspect commands and file readsOpenCodeCodexClaude CodeCursor AgentReal agent sessionsIsolated workspacesOpen source

HOW IT WORKS

Change the skill. Check the run.

RUN

Test the skill for real.

Run real agent sessions after every SKILL.md change. Check what the agent did, not what you expected.

ASSERT

Assert the skill and the steps.

Use TypeScript assertions to check the right skill was loaded, the right files were read, the right commands ran, and the final answer was correct.

INSPECT

Every skill run leaves evidence.

When a skill edit changes behavior, compare artifacts, inspect the run, and see how token usage changed.

RUNNER

Turn skill edits
into checks.

When you change a SKILL.md, rerun the same cases and see what changed. skillgym records each run so failed behavior is easy to inspect.

npm install --save-dev skillgym npx skillgym run ./skillgym/my-suite.ts
standard reporter
Suite examples/basic-suite.ts
Workspace ~/Projects/acme-app
Cases 3
Runners 2
Executions 6
caseexecutionsstatus
always-passes2/2passed
assertion-fails1/2failed
assert-crashes0/2failed
Runner: open-main (opencode, openai/gpt-4o)
casetimeinoutbillable
always-passes12.3s1,2408908,204
assertion-fails14.8s2,0401,10011,240
assert-crashes3.1s410120892
2 failed · 1 passed · Duration 1m 42s

RUNNERS

Same skill.
Different agents.

The same skill edit can behave differently on each runner. Compare OpenCode, Codex, Claude Code, and Cursor Agent on the same cases before you trust the change.

OpenCode

an open source AI coding agent

opencode

Codex

OpenAI's terminal coding agent

codex

Claude Code

Anthropic's coding agent CLI

claude-code

Cursor Agent

Cursor CLI agent mode

cursor-agent
assertions.ts
import { assert } from "skillgym";

// assert the skill was loaded
assert.skills.has(report, "agent-device");

// assert the reference was loaded
assert.fileReads.includes(report, /bootstrap-install\.md$/);

// assert commands ran in the right order
assert.commands.before(
  report,
  /^agent-devices+open/,
  /^agent-devices+snapshot/
);
assert.commands.before(
  report,
  /^agent-devices+snapshot/,
  /^agent-devices+close/
);

ASSERTIONS

Good skill behavior
is testable.

The right skill. The right files. The right steps. Use assertions to turn expectations into pass/fail checks after every SKILL.md edit.

  • assert.skills.* — check which skills were used
  • assert.commands.* — check commands and ordering
  • assert.fileReads.* — check what the agent opened
  • assert.toolCalls.* — inspect tool usage in detail
  • assert.output.* — check the final answer

SNAPSHOTS

Count tokens.
Catch costly skill regressions.

Set a baseline and catch when a skill edit makes the agent much more expensive. Track token growth before it becomes the baseline.

npx skillgym run ./suite.ts --update-snapshots

WORKSPACES

Fresh workspace.
Honest skill executions.

Use isolated workspaces when repo state matters, or share one when you want speed. Make skill failures easier to reproduce and debug.

export const workspace = {
  mode: "isolated",
  templateDir: "./fixtures/base",
  bootstrap: {
    command: "npm",
    args: ["install"],
  },
};

QUICK START

Go from prompt to benchmark
in three steps.

01

Install

npm install --save-dev skillgym

02

Configure

// skillgym.config.ts
import type { SkillgymConfig } from "skillgym";

const config: SkillgymConfig = {
  runners: {
    "my-agent": {
      agent: {
        type: "opencode",
        model: "openai/gpt-4o",
      },
    },
  },
};

export default config;

03

Run

// skillgym/my-suite.ts
import { assert, type Case } from "skillgym";

const suite: Case[] = [
  {
    id: "agent-device-settings-check",
    prompt: [
      "Use agent-device to open Settings,",
      'verify "Privacy" is visible with snapshot,',
      "then close the session.",
    ].join(" "),
    async assert(report) {
      assert.skills.has(report, "agent-device");
      assert.commands.before(
        report,
        /^agent-devices+open/,
        /^agent-devices+snapshot/
      );
      assert.commands.before(
        report,
        /^agent-devices+snapshot/,
        /^agent-devices+close/
      );
    },
  },
];

export default suite;

OPEN SOURCE · MIT LICENSE

Stop guessing.
Check the skill.

Catch skill regressions, compare agents, and keep proof from every run.