Evals for subjective, stateful agents

Built rubric-driven replayable eval system from real user projects giving quality/cost/latency/error/token signals in <6 hours per model change. Evolved into dev flywheel powered by real user dissatisfaction signals.

Details

City

Date

Time

Speaker(s)

Yikai Zhu

Software Engineer,

Descript

Ajay Arasanipalai

AI Researcher,

Descript

Anthropic's developer conference, recorded

Keynotes, demos, and conversations with the teams behind Claude. Recorded at Code w/ Claude 2026 San Francisco and ready to replay.