01
Load
A realistic support case — a fake Stripe and CRM, seeded with one trap.
Five ways a refund agent quietly burns a real customer. Each scenario is seeded with one.
A realistic support case — a fake Stripe and CRM, seeded with one trap.
Your agent works the case. Every tool call it makes is captured.
A deterministic state diff grades the outcome. PASS or FAIL, nothing in between.
Every run emits structured JSON, so the same verdict drops straight into CI or a regression suite. No screenshots, no judgement calls.