README.md Raw

Agentic Chat Eval Harness Capability Buildprint

This Capability Buildprint installs a rigorous evaluation harness for existing Agentic Chat systems.

It does not build a chat product. It adds a repeatable proof layer that can evaluate multi-turn conversation behavior, tool/action correctness, provider routing, memory/state continuity, recovery behavior, UI evidence, and optional RAG grounding.

What it adds

scenario fixtures for agentic chat tasks
simulated user turns and interruption paths
trace/span collection for model calls, tool calls, handoffs, guardrails, retrieval, UI events, and custom host events
expected tool/action side-effect checks
model/provider routing and fallback checks
memory/state continuity checks
streaming and UI proof checks
optional RAG profile for retrieval, citation, grounding, permission, and stale-index checks
deterministic scorers plus bounded model-judge scorers
regression command and machine-readable receipts

Design thesis

Final-answer grading is not enough.

An Agentic Chat harness must inspect:

what the agent saw
what it asked
what tool it chose
what arguments it sent
what the tool changed
how it recovered from failures
what state or memory changed
whether the UI represented the real state
whether retrieved evidence actually supported the answer

If those artifacts are unavailable, the harness must downgrade the proof level instead of reporting success.

Preferred baseline

Use existing host tooling when it is already available. Otherwise prefer:

runner style: pytest/Vitest or equivalent host test runner
trace model: OpenTelemetry-compatible spans or host-neutral JSON spans
scenario format: versioned YAML/JSON fixtures
deterministic scorers: assertions over traces, tool results, state diffs, citations, and UI evidence
model-judge scorers: optional, rubric-bound, example-calibrated, and never sole proof for high-risk claims
optional integrations: OpenAI Evals, Inspect AI, DeepEval, Braintrust, Phoenix, Langfuse, OpenLLMetry, Ragas, TruLens, or LlamaIndex evaluation adapters

The durable contract is the proof behavior, not a specific vendor.

Profiles

Core chat

Evaluates multi-turn task completion, instruction adherence, user-questioning, blocked states, refusal boundaries, transcript quality, and final answer usefulness.

Tool actions

Evaluates tool selection, tool argument correctness, side effects, retries, idempotency, and recovery from tool failure.

Memory and state

Evaluates durable state changes, memory writes, compaction behavior, state diffs, and stale-memory avoidance.

Provider routing

Evaluates model/provider selection, fallback, retry policy, latency/cost capture, and degraded-mode behavior.

UI proof

Evaluates streaming, action visibility, loading/error/blocked states, receipt access, and absence of fake success or raw debug leakage.

RAG

Optional profile. Evaluates retrieval allow/deny behavior, context precision, context recall, groundedness, answer relevance, citation coverage, unsupported claim rate, stale/deleted content exclusion, and weak-evidence uncertainty.

Non-negotiables

No pass from final text alone.
No tool/action pass without expected side-effect proof.
No RAG pass without retrieved context, citations, and deny-path proof.
No UI pass from prose description alone.
No model-judge-only pass for security, billing, legal, destructive, or permission-sensitive behavior.
No benchmark claim without pinned scenario/dataset version.
No production claim without a regression command and .buildprint/agentic-chat-eval-receipt.md.

Where to start

Start with BUILDPRINT.md. The README is the human overview; the Buildprint files are the executable contract.

See examples/core-chat-scenario.yaml for a minimal scenario shape and examples/eval-receipt.md for the expected receipt structure.