You should build evaluation engineers instead of evaluation sets
How to design evals for long-running agent harnesses that Claude Code can hill climb. Current evaluation approaches are insufficient for agents working hours on complex tasks.
Details
City
Date
Time
Speaker(s)
Anker Bach Ryhl
Parahelp
Anthropic's developer conference, recorded
Keynotes, demos, and conversations with the teams behind Claude. Recorded at Code w/ Claude 2026 San Francisco and ready to replay.