Evals for subjective, stateful agents

Built rubric-driven replayable eval system from real user projects giving quality/cost/latency/error/token signals in <6 hours per model change. Evolved into dev flywheel powered by real user dissatisfaction signals.

Details

City

San Francisco, USA

Date

May 7, 2026

Time

11:30AM – 12:00PM

Speaker(s)

Yikai Zhu

Software Engineer,

Descript

Ajay Arasanipalai

AI Researcher,

Descript

Anthropic's developer conference, recorded

Keynotes, demos, and conversations with the teams behind Claude. Recorded at Code w/ Claude 2026 San Francisco and ready to replay.