Real-world agent evals
The browser is the real agent benchmark
Most agent demos skip the part that breaks in production: real websites. Auth flows, iframes, Shadow DOM, rich-text editors, redirects, passkeys, and verification receipts.
What the benchmark cares about
- Persistent authenticated browser profiles.
- MFA and CAPTCHA handoff without leaking credentials.
- Iframe switching and Shadow DOM traversal.
- ProseMirror and Quill rich-text editors.
- Passkey/WebAuthn prompts that sit outside the DOM.
- Responsive layout shifts and hidden controls.
- Final-state verification instead of trusting a process exit.
Receipts over vibes
If an agent says it posted, submitted, updated, or paid for something, it needs a receipt: a final URL, a DOM state, a server response, a public page check, or a screenshot.
That is the difference between a chatbot and infrastructure.