Evals for subjective, stateful agents
Built rubric-driven replayable eval system from real user projects giving quality/cost/latency/error/token signals in <6 hours per model change. Evolved into dev flywheel powered by real user dissatisfaction signals.
Details
City
San Francisco, USA
Date
May 7, 2026
Time
11:30AM – 12:00PM
Speaker(s)
Yikai Zhu
Descript
Ajay Arasanipalai
Descript
Anthropic's developer conference, recorded
Keynotes, demos, and conversations with the teams behind Claude. Recorded at Code w/ Claude 2026 San Francisco and ready to replay.