OPEN SOURCE · SKILL REGRESSION TESTING

Changed the skill.
Did it still do the right thing?

Catch skill regressions.
Keep the proof.

Star on GitHub Read the docs →

skillgym run

$ skillgym run ./examples/basic-suite.ts

Suite examples/basic-suite.ts

Workspace ~/Projects/acme-app

Cases 3

Runners 2

Executions 6

caseexecutionsstatus

✓always-passes2/2passed

✗assertion-fails1/2failed

✗assert-crashes0/2failed

Runner: open-main (opencode, openai/gpt-4o)

casetimeinoutbillable

✓always-passes12.3s1,2408908,204

✓assertion-fails14.8s2,0401,10011,240

✗assert-crashes3.1s410120892

2 failed · 1 passed · 1m 42s

Catch skill regressionsTypeScript assertionsToken usage snapshotsCount tokens per executionSame skill, every agentProof from every executionInspect commands and file readsOpenCodeCodexClaude CodeCursor AgentReal agent sessionsIsolated workspacesOpen sourceCatch skill regressionsTypeScript assertionsToken usage snapshotsCount tokens per executionSame skill, every agentProof from every executionInspect commands and file readsOpenCodeCodexClaude CodeCursor AgentReal agent sessionsIsolated workspacesOpen source

HOW IT WORKS

Change the skill. Check the run.

RUN

Test the skill for real.

Run real agent sessions after every SKILL.md change. Check what the agent did, not what you expected.

ASSERT

Assert the skill and the steps.

Use TypeScript assertions to check the right skill was loaded, the right files were read, the right commands ran, and the final answer was correct.

INSPECT

Every skill run leaves evidence.

When a skill edit changes behavior, compare artifacts, inspect the run, and see how token usage changed.

RUNNER

Turn skill edits
into checks.

When you change a SKILL.md, rerun the same cases and see what changed. skillgym records each run so failed behavior is easy to inspect.

npm install --save-dev skillgym npx skillgym run ./skillgym/my-suite.ts

standard reporter

Suite examples/basic-suite.ts

Workspace ~/Projects/acme-app

Cases 3

Runners 2

Executions 6

caseexecutionsstatus

✓always-passes2/2passed

✗assertion-fails1/2failed

✗assert-crashes0/2failed

Runner: open-main (opencode, openai/gpt-4o)

casetimeinoutbillable

✓always-passes12.3s1,2408908,204

✓assertion-fails14.8s2,0401,10011,240

✗assert-crashes3.1s410120892

2 failed · 1 passed · Duration 1m 42s

RUNNERS

Same skill.
Different agents.

The same skill edit can behave differently on each runner. Compare OpenCode, Codex, Claude Code, and Cursor Agent on the same cases before you trust the change.

OpenCode

an open source AI coding agent

opencode

Codex

OpenAI's terminal coding agent

codex

Claude Code

Anthropic's coding agent CLI

claude-code

Cursor Agent

Cursor CLI agent mode

cursor-agent

    
assertions.ts
 import { assert } from "skillgym";

// assert the skill was loaded
assert.skills.has(report, "agent-device");

// assert the reference was loaded
assert.fileReads.includes(report, /bootstrap-install\.md$/);

// assert commands ran in the right order
assert.commands.before(
  report,
  /^agent-devices+open/,
  /^agent-devices+snapshot/
);
assert.commands.before(
  report,
  /^agent-devices+snapshot/,
  /^agent-devices+close/
); 

ASSERTIONS

Good skill behavior
is testable.

The right skill. The right files. The right steps. Use assertions to turn expectations into pass/fail checks after every SKILL.md edit.

assert.skills.* — check which skills were used
assert.commands.* — check commands and ordering
assert.fileReads.* — check what the agent opened
assert.toolCalls.* — inspect tool usage in detail
assert.output.* — check the final answer

SNAPSHOTS

Count tokens.
Catch costly skill regressions.

Set a baseline and catch when a skill edit makes the agent much more expensive. Track token growth before it becomes the baseline.

npx skillgym run ./suite.ts --update-snapshots

WORKSPACES

Fresh workspace.
Honest skill executions.

Use isolated workspaces when repo state matters, or share one when you want speed. Make skill failures easier to reproduce and debug.

export const workspace = {
  mode: "isolated",
  templateDir: "./fixtures/base",
  bootstrap: {
    command: "npm",
    args: ["install"],
  },
};

QUICK START

Go from prompt to benchmark
in three steps.

Install

npm install --save-dev skillgym

Configure

// skillgym.config.ts
import type { SkillgymConfig } from "skillgym";

const config: SkillgymConfig = {
  runners: {
    "my-agent": {
      agent: {
        type: "opencode",
        model: "openai/gpt-4o",
      },
    },
  },
};

export default config;

Run

// skillgym/my-suite.ts
import { assert, type Case } from "skillgym";

const suite: Case[] = [
  {
    id: "agent-device-settings-check",
    prompt: [
      "Use agent-device to open Settings,",
      'verify "Privacy" is visible with snapshot,',
      "then close the session.",
    ].join(" "),
    async assert(report) {
      assert.skills.has(report, "agent-device");
      assert.commands.before(
        report,
        /^agent-devices+open/,
        /^agent-devices+snapshot/
      );
      assert.commands.before(
        report,
        /^agent-devices+snapshot/,
        /^agent-devices+close/
      );
    },
  },
];

export default suite;

Changed the skill.Did it still do the right thing?

Change the skill. Check the run.

Test the skill for real.

Assert the skill and the steps.

Every skill run leaves evidence.

Turn skill editsinto checks.

Same skill.Different agents.

Good skill behavioris testable.

Count tokens.Catch costly skill regressions.

Fresh workspace.Honest skill executions.

Go from prompt to benchmarkin three steps.