Evals for taste: Hill-climbing a slide-generation agent

Built rubric-driven replayable eval system from real user projects giving quality/cost/latency/error/token signals in <6 hours per model change. Evolved into dev flywheel powered by real user dissatisfaction signals.

Details

City

London, UK

Date

20 May 2026

Time

13:00 – 13:45

Speaker(s)

Jiri De Jonghe

Member of Technical Staff,

Anthropic

Anthropic's developer conference, recorded

Keynotes, demos, and conversations with the teams behind Claude. Recorded at Code w/ Claude 2026 San Francisco and ready to replay.