Real-world agent evals

The browser is the real agent benchmark

Most agent demos skip the part that breaks in production: real websites. Auth flows, iframes, Shadow DOM, rich-text editors, redirects, passkeys, and verification receipts.

Read the full write-up GitHub Portfolio

What the benchmark cares about

Persistent authenticated browser profiles.
MFA and CAPTCHA handoff without leaking credentials.
Iframe switching and Shadow DOM traversal.
ProseMirror and Quill rich-text editors.
Passkey/WebAuthn prompts that sit outside the DOM.
Responsive layout shifts and hidden controls.
Final-state verification instead of trusting a process exit.

Receipts over vibes

If an agent says it posted, submitted, updated, or paid for something, it needs a receipt: a final URL, a DOM state, a server response, a public page check, or a screenshot.

That is the difference between a chatbot and infrastructure.