Real-world agent evals

The browser is the real agent benchmark

Most agent demos skip the part that breaks in production: real websites. Auth flows, iframes, Shadow DOM, rich-text editors, redirects, passkeys, and verification receipts.

Read the full write-up GitHub Portfolio

What the benchmark cares about

Receipts over vibes

If an agent says it posted, submitted, updated, or paid for something, it needs a receipt: a final URL, a DOM state, a server response, a public page check, or a screenshot.

That is the difference between a chatbot and infrastructure.