Evals for taste: Hill-climbing a slide-generation agent

Built rubric-driven replayable eval system from real user projects giving quality/cost/latency/error/token signals in <6 hours per model change. Evolved into dev flywheel powered by real user dissatisfaction signals.

Details

City
Tokyo, JP
Date
11 June 2026
Time
13:00 – 13:45
Speaker(s)
Jonah Dueck
Member of Technical Staff,
 
Anthropic

Agenda

08:00 – 09:00
Check-in and breakfast
09:30 – 09:45
Opening keynote
·
Angela Jiang
Cat Wu
Katelyn Lesse
morning sessions
11:45 – 13:30
Lunch
afternoon sessions
Evening
15:45 – 18:00
Closing reception