Evals for taste: Hill-climbing a slide-generation agent

Built rubric-driven replayable eval system from real user projects giving quality/cost/latency/error/token signals in <6 hours per model change. Evolved into dev flywheel powered by real user dissatisfaction signals.

Details

City
London, UK
Date
20 May 2026
Time
13:00 – 13:45
Speaker(s)
Jiri De Jonghe
Member of Technical Staff,
 
Anthropic