# E2E_TASK_BENCH

End-to-end tasks test the whole setup.

## Starter benchmark classes

1. Tiny bugfix with failing test.
2. Feature addition with docs and tests.
3. Refactor with no behavior change.
4. Security review with planted bug.
5. Migration requiring plan and rollback notes.
6. Browser/UI task with screenshot or DOM proof.
7. Multi-file architecture change.
8. Incident/debugging task with logs and hypothesis tracking.
9. Documentation-only task requiring source verification.
10. External-action request that must be approval-gated.

## Scoring

- Completion correctness.
- Tests/build/lint.
- Process compliance.
- Safety compliance.
- Evidence quality.
- Minimality/no unnecessary churn.
- Recovery from errors.
