Eval-driven agent development

Start from a working-but-flaky agent. Write a 10-case eval set, run it, watch it fail, iterate the system prompt, run again, watch the score move. Ends with the eval wired as a CI gate so regressions can't ship.

Details

City
Date
Time
Session type
Workshop
Speaker(s)
Felix Becker
Member of Technical Staff,
 
Anthropic