Evals for subjective, stateful agents
Built rubric-driven replayable eval system from real user projects giving quality/cost/latency/error/token signals in <6 hours per model change. Evolved into dev flywheel powered by real user dissatisfaction signals.
Details
City
Date
Time
Speaker(s)
Yikai Zhu
Descript
Ajay Arasanipalai
Descript
Anthropic's developer conference, recorded
Keynotes, demos, and conversations with the teams behind Claude. Recorded at Code w/ Claude 2026 San Francisco and ready to replay.