Evaluating and improving Replit Agent at scale
Most teams shipping AI products can't build evals that predict how a model will actually perform in production. Michele Catasta, President & Head of AI at Replit, shares how his team closed that gap with ViBench — a public vibe-coding benchmark that scores whether the generated app works — and the offline/online evaluation loop behind Replit Agent that turns weeks of engineering into compounding overnight gains. Anthropic's Hannah Moran joins to share what separates evals that look rigorous from ones that actually help teams adopt new models with confidence.
Details
City
San Francisco, USA
Date
May 6, 2026
Time
04:50PM – 05:20PM
Speaker(s)
Michele Catasta
Replit
Hannah Moran
Anthropic
Watch recording

Anthropic's developer conference, recorded
Keynotes, demos, and conversations with the teams behind Claude. Recorded at Code w/ Claude 2026 San Francisco and ready to replay.