OPEN SOURCE · AGENT BENCHMARKING

Did your agent
do the right thing?

Real CLI sessions. TypeScript assertions.
Evidence you can trust.

skillgym run
$ skillgym run ./examples/basic-suite.ts
Suite examples/basic-suite.ts
Workspace ~/Projects/acme-app
Cases 3
Runners 2
Runs 6
always-passes 2/2 2 runners
assertion-fails 1/2 2 runners
assert-crashes 0/2 2 runners
Runner: open-main (opencode, openai/gpt-4o)
casetimebillable
always-passes 12.3s 8,204
assertion-fails 14.8s 11,240
assert-crashes 3.1s 892
2 failed · 1 passed · 1m 42s
Real CLI sessionsNo mocked chatsNormalized SessionReportTypeScript assertionsMulti-runner matrixPreserved telemetryToken usage snapshotsOpenCodeCodexClaude CodeEvidence-based skillsPluggable reportersIsolated workspacesOpen sourceReal CLI sessionsNo mocked chatsNormalized SessionReportTypeScript assertionsMulti-runner matrixPreserved telemetryToken usage snapshotsOpenCodeCodexClaude CodeEvidence-based skillsPluggable reportersIsolated workspacesOpen source

HOW IT WORKS

Run it. Assert it. Trust it.

RUN

Real CLIs, real environments.

No mocked sessions. No simulated output. skillgym invokes the actual agent CLI and captures everything it does.

ASSERT

TypeScript checks on skills and commands.

Write assertions against skills loaded, commands run, and files read. Standard node:assert/strict under the hood.

INSPECT

Artifacts you can diff across runs.

Every execution writes a normalized report. Token counts, event traces, and session output — preserved for every run.

RUNNER

One run.
Structured results.

Point skillgym at any suite file. It expands every case against every configured runner, captures the session, and writes a normalized report you can assert against.

npm install --save-dev skillgym npx skillgym run ./skillgym/my-suite.ts
standard reporter
Suite examples/basic-suite.ts
Output .skillgym-results/run-2026-04-13
caserunsstatus
always-passes 2/2 passed
assertion-fails 1/2 failed
assert-crashes 0/2 failed
Runner: open-main (opencode, openai/gpt-4o)
casetimeinoutbillable
always-passes 12.3s 1,240 890 8,204
assertion-fails 14.8s 2,040 1,100 11,240
assert-crashes 3.1s 410 120 892
Runner: code-main (codex, openai/gpt-4.1)
always-passes 11.9s 1,180 820 7,960
assertion-fails 16.2s 2,200 1,240 12,000
assert-crashes 2.8s 380 95 801
2 failed · 1 passed · Duration 1m 42s

RUNNERS

Same prompts.
Surface where agents diverge.

One suite. Every agent. Differences exposed. Run the same cases against OpenCode, Codex, and Claude Code — results land side by side in the same structured report.

OpenCode

an open source AI coding agent

opencode

Codex

OpenAI's terminal coding agent

codex

Claude Code

Anthropic's coding agent CLI

claude-code
assertions.ts
import { assert } from "skillgym";

// skill was loaded before any action
assert.skills.has(report, "find-skills");

// skills find ran before pnpm install
assert.commands.before(
  report,
  /skills find/,
  /pnpm install/
);

// the right SKILL.md was read
assert.fileReads.includes(
  report,
  /find-skills\/SKILL\.md$/
);

// agent produced output
assert.output.notEmpty(report);

ASSERTIONS

Encode good.
Enforce it every run.

Assertions turn telemetry into a quality gate. Skills loaded. Commands run in the right order. Files read. Output produced. Each check runs against the normalized session report.

  • assert.skills.* — skill detection by confidence
  • assert.commands.* — exact and pattern matchers
  • assert.fileReads.* — file access tracking
  • assert.toolCalls.* — tool call inspection
  • assert.output.* — final output checks

SNAPSHOTS

Token baselines.
Automatic regressions.

Set a baseline. Get alerted when token usage regresses beyond your configured tolerance. Absolute and percent thresholds per metric.

npx skillgym run ./suite.ts --update-snapshots

WORKSPACES

Isolated.
Or shared. Your call.

Own workspace per run when you need clean state. Share one when you don't. Template directories and bootstrap commands included.

export const workspace = {
  mode: "isolated",
  templateDir: "./fixtures/base",
  bootstrap: {
    command: "npm",
    args: ["install"],
  },
};

QUICK START

Up in three steps.

01

Install

npm install --save-dev skillgym

02

Configure

// skillgym.config.ts
import type { SkillGymConfig } from "skillgym";

const config: SkillGymConfig = {
  runners: {
    "my-agent": {
      agent: {
        type: "opencode",
        model: "openai/gpt-4o",
      },
    },
  },
};

export default config;

03

Run

// skillgym/my-suite.ts
import type { TestSuite } from "skillgym";
import { assert } from "skillgym";

const suite: TestSuite = [
  {
    id: "smoke",
    prompt: "Say only: skillgym ready",
    assert(report, ctx) {
      assert.match(
        ctx.finalOutput(),
        /skillgym ready/
      );
    },
  },
];

export default suite;

OPEN SOURCE · MIT LICENSE

Real runs.
Real evidence.

Your agent's behavior, in plain TypeScript assertions.