OPEN SOURCE · AGENT BENCHMARKING
Real CLI sessions. TypeScript assertions.
Evidence you can trust.
HOW IT WORKS
RUN
No mocked sessions. No simulated output. skillgym invokes the actual agent CLI and captures everything it does.
ASSERT
Write assertions against skills loaded, commands run, and files read. Standard node:assert/strict under the hood.
INSPECT
Every execution writes a normalized report. Token counts, event traces, and session output — preserved for every run.
RUNNER
Point skillgym at any suite file. It expands every case against every configured runner, captures the session, and writes a normalized report you can assert against.
npm install --save-dev skillgym npx skillgym run ./skillgym/my-suite.ts RUNNERS
One suite. Every agent. Differences exposed. Run the same cases against OpenCode, Codex, and Claude Code — results land side by side in the same structured report.
OpenCode
an open source AI coding agent
Codex
OpenAI's terminal coding agent
Claude Code
Anthropic's coding agent CLI
import { assert } from "skillgym";
// skill was loaded before any action
assert.skills.has(report, "find-skills");
// skills find ran before pnpm install
assert.commands.before(
report,
/skills find/,
/pnpm install/
);
// the right SKILL.md was read
assert.fileReads.includes(
report,
/find-skills\/SKILL\.md$/
);
// agent produced output
assert.output.notEmpty(report); ASSERTIONS
Assertions turn telemetry into a quality gate. Skills loaded. Commands run in the right order. Files read. Output produced. Each check runs against the normalized session report.
assert.skills.* — skill detection by confidenceassert.commands.* — exact and pattern matchersassert.fileReads.* — file access trackingassert.toolCalls.* — tool call inspectionassert.output.* — final output checksSNAPSHOTS
Set a baseline. Get alerted when token usage regresses beyond your configured tolerance. Absolute and percent thresholds per metric.
npx skillgym run ./suite.ts --update-snapshots
WORKSPACES
Own workspace per run when you need clean state. Share one when you don't. Template directories and bootstrap commands included.
export const workspace = {
mode: "isolated",
templateDir: "./fixtures/base",
bootstrap: {
command: "npm",
args: ["install"],
},
}; QUICK START
01
npm install --save-dev skillgym
02
// skillgym.config.ts
import type { SkillGymConfig } from "skillgym";
const config: SkillGymConfig = {
runners: {
"my-agent": {
agent: {
type: "opencode",
model: "openai/gpt-4o",
},
},
},
};
export default config; 03
// skillgym/my-suite.ts
import type { TestSuite } from "skillgym";
import { assert } from "skillgym";
const suite: TestSuite = [
{
id: "smoke",
prompt: "Say only: skillgym ready",
assert(report, ctx) {
assert.match(
ctx.finalOutput(),
/skillgym ready/
);
},
},
];
export default suite; OPEN SOURCE · MIT LICENSE
Your agent's behavior, in plain TypeScript assertions.