↻ Workflow OS ✓ Validated Updated Not dated

Complete Agent Skills Evaluation OS

A blueprint for evaluating an entire coding-agent setup, not just one skill: installed skills, agents, commands, hooks, MCP, routers, subagents, workflow discipline, token cost, safety, and CI evidence.

Package JSON Source: balyakin/skill-eval-runner + eval stack

-- views

What this builds

A reusable agent workflow packet.

A validated Buildprint for evaluating complete agent+skills installations from static validity through real behavior, with skill-eval-runner as the core module but not the whole system.

Setup snapshot and install parity model
Static lint gates for agent config files
Loadout and token-cost inventory
Skill unit/regression test harness pattern
Activation and routing eval pattern
Transcript/process invariant checks

Core capabilities

The useful parts the finished build should expose.

01 Setup snapshot and install parity model

02 Static lint gates for agent config files

03 Loadout and token-cost inventory

04 Skill unit/regression test harness pattern

05 Activation and routing eval pattern

06 Transcript/process invariant checks

What you need

Local first, live proof explicit.

A target agent setup to evaluate
Offline fixture cases for deterministic mode
Optional live agent/provider credentials for live adapters
A safety policy for external/destructive actions
A list of critical workflow invariants

System shape

What kind of artifact this becomes.

Workflow surface

A validated Buildprint for evaluating complete agent+skills installations from static validity through real behavior, with skill-eval-runner as the core module but not the whole system.

Runtime layer

Any coding agent, JavaScript proof, CI-ready adapters

Build materials

Snapshot / Lint / Inventory / Skill tests

Proof boundary

Deep stack design / Offline proof passed

Build scope

Included, required from you, and outside the claim.

Included

Setup snapshot and install parity model
Static lint gates for agent config files
Loadout and token-cost inventory
Skill unit/regression test harness pattern

Bring yourself

A target agent setup to evaluate
Offline fixture cases for deterministic mode
Optional live agent/provider credentials for live adapters
A safety policy for external/destructive actions

Out of scope

Mistaking per-skill tests for full setup proof
Ignoring activation failures
Skipping transcript/order evidence
Measuring a drifted install

Agent handoff

Start from the packet, not the UI.

agb start https://agent-buildprint.com/buildprints/complete-agent-skills-evaluation-os/package.json

Agent guide Manifest GitHub files

Key files

The first files an agent should read.

BUILDPRINT.md compatibility bootstrap or package contract

All package files

ACTIVATION_EVALS.md Buildprint package file

BUILDPRINT.md compatibility bootstrap or package contract

checks/acceptance.md acceptance checklist

CONTRACTS.md legacy interface/data contracts, when present

E2E_TASK_BENCH.md Buildprint package file

LOADOUT_INVENTORY.md Buildprint package file

MULTI_AGENT_SAFETY.md Buildprint package file

PLAN.md legacy execution index, when present

proof/package-lock.json offline proof artifact

proof/package.json offline proof artifact

proof/src/eval-os.mjs offline proof artifact

proof/test/eval-os.test.mjs offline proof artifact

publication.json machine-readable mirror

README.md human overview, non-authoritative

SAFETY_POLICY.md Buildprint package file

SCORECARD.md Buildprint package file

SKILL_UNIT_EVALS.md Buildprint package file

SPEC.md legacy behavior requirements, when present

STATIC_LINT.md Buildprint package file

TEST_MATRIX.md legacy risk-to-test alignment, when present

TRANSCRIPT_PROCESS_EVALS.md Buildprint package file

VALIDATION_REPORT.md Buildprint package file

VALIDATION_TEMPLATE.md legacy completion report template, when present