← Explore

@agent-buildprint/agentic-chat-eval-harness

Agentic Chat Eval Harness

A trace-aware, scenario-based Capability Buildprint for grafting a rigorous Agentic Chat evaluation harness into an existing host, with profiles for core chat, tool/actions, memory/state, provider routing, UI proof, and optional RAG grounding.

-- views
README.md Raw

Agentic Chat Eval Harness Capability Buildprint

This Capability Buildprint installs a rigorous evaluation harness for existing Agentic Chat systems.

It does not build a chat product. It adds a repeatable proof layer that can evaluate multi-turn conversation behavior, tool/action correctness, provider routing, memory/state continuity, recovery behavior, UI evidence, and optional RAG grounding.

What it adds

  • scenario fixtures for agentic chat tasks
  • simulated user turns and interruption paths
  • trace/span collection for model calls, tool calls, handoffs, guardrails, retrieval, UI events, and custom host events
  • expected tool/action side-effect checks
  • model/provider routing and fallback checks
  • memory/state continuity checks
  • streaming and UI proof checks
  • optional RAG profile for retrieval, citation, grounding, permission, and stale-index checks
  • deterministic scorers plus bounded model-judge scorers
  • regression command and machine-readable receipts

Design thesis

Final-answer grading is not enough.

An Agentic Chat harness must inspect:

  • what the agent saw
  • what it asked
  • what tool it chose
  • what arguments it sent
  • what the tool changed
  • how it recovered from failures
  • what state or memory changed
  • whether the UI represented the real state
  • whether retrieved evidence actually supported the answer

If those artifacts are unavailable, the harness must downgrade the proof level instead of reporting success.

Preferred baseline

Use existing host tooling when it is already available. Otherwise prefer:

  • runner style: pytest/Vitest or equivalent host test runner
  • trace model: OpenTelemetry-compatible spans or host-neutral JSON spans
  • scenario format: versioned YAML/JSON fixtures
  • deterministic scorers: assertions over traces, tool results, state diffs, citations, and UI evidence
  • model-judge scorers: optional, rubric-bound, example-calibrated, and never sole proof for high-risk claims
  • optional integrations: OpenAI Evals, Inspect AI, DeepEval, Braintrust, Phoenix, Langfuse, OpenLLMetry, Ragas, TruLens, or LlamaIndex evaluation adapters

The durable contract is the proof behavior, not a specific vendor.

Profiles

Core chat

Evaluates multi-turn task completion, instruction adherence, user-questioning, blocked states, refusal boundaries, transcript quality, and final answer usefulness.

Tool actions

Evaluates tool selection, tool argument correctness, side effects, retries, idempotency, and recovery from tool failure.

Memory and state

Evaluates durable state changes, memory writes, compaction behavior, state diffs, and stale-memory avoidance.

Provider routing

Evaluates model/provider selection, fallback, retry policy, latency/cost capture, and degraded-mode behavior.

UI proof

Evaluates streaming, action visibility, loading/error/blocked states, receipt access, and absence of fake success or raw debug leakage.

RAG

Optional profile. Evaluates retrieval allow/deny behavior, context precision, context recall, groundedness, answer relevance, citation coverage, unsupported claim rate, stale/deleted content exclusion, and weak-evidence uncertainty.

Non-negotiables

  • No pass from final text alone.
  • No tool/action pass without expected side-effect proof.
  • No RAG pass without retrieved context, citations, and deny-path proof.
  • No UI pass from prose description alone.
  • No model-judge-only pass for security, billing, legal, destructive, or permission-sensitive behavior.
  • No benchmark claim without pinned scenario/dataset version.
  • No production claim without a regression command and .buildprint/agentic-chat-eval-receipt.md.

Where to start

Start with BUILDPRINT.md. The README is the human overview; the Buildprint files are the executable contract.

See examples/core-chat-scenario.yaml for a minimal scenario shape and examples/eval-receipt.md for the expected receipt structure.