Eval-driven agent development

Start from a working-but-flaky agent. Write a 10-case eval set, run it, watch it fail, iterate the system prompt, run again, watch the score move. Ends with the eval wired as a CI gate so regressions can't ship.

Details

City

Date

Time

Session type

Workshop

Speaker(s)

Felix Becker

Member of Technical Staff,

Anthropic

Anthropic's developer conference, recorded

Keynotes, demos, and conversations with the teams behind Claude. Recorded at Code w/ Claude 2026 San Francisco and ready to replay.