Terminal Bench
The standard benchmark for AI agent evaluation. 10,000+ expert-crafted tasks in stateful Docker environments, used by virtually every frontier lab to measure what their agents can actually do.
Powered by Caudal's expert network
Caudal's domain experts design the tasks that break frontier models. Not crowdsourced — vetted specialists who understand where AI reasoning falls apart.
Adversarial Task Design
Tasks built to work against LLMs, not help them succeed.
Comparative Signal Analysis
We verify harder tasks still measure what they claim, not just noise.
Oracle-Validated Quality
Every task goes through oracle verification and dual-agent testing before acceptance.
Domain Expert Contributors
Vetted engineers specialized across security, DevOps, backend, and data science.
1,000+ expert-level topics
High-precision data development for the challenges and tasks generalist workflows can't address.
Built by domain experts. Verified by programmatic execution. Designed for frontier reasoning.
Tasks
1000+ verifiable CLI tasks across 30+ domains.
Infrastructure
Secure sandbox with stateful rollbacks.
Evaluation
Automated bash assertions, not LLM-as-a-judge.
Coverage
Sysadmin, DevOps, Data Science, and more.
What our contributors build
Every task is an evaluation artifact — designed to work against LLMs, validated by oracle execution, and accepted only if it exposes real reasoning failures. These are not coding problems. They are traps.
Fix a Slow Application
An app that lists books starts freezing as more users visit. The problem? It asks the database hundreds of separate questions when it should ask just two or three. Our contributor finds the bottleneck, restructures how the app talks to the database, and proves it stays fast even under heavy traffic. AI agents usually fix one part but miss the other, leaving the app still slow.
Rebuild an App on a New Foundation
An entire web application needs to move from one technology stack to another — new login system, new way of handling data, new API design. The contributor rewires 15–20 files, replacing old patterns with modern ones while making sure nothing breaks for existing users. AI agents commonly mix old and new patterns together, creating code that looks right but fails under real use.
Upgrade a Core Library Without Breaking Everything
A critical library the app depends on releases a major update that changes how everything works. The contributor traces every ripple effect across the entire codebase and updates each piece to match the new version. AI agents frequently make surface-level fixes that pass initial checks but silently break when real users interact with the app.
Build the benchmark with us
We're looking for domain experts who can design tasks that push AI agents to their limits.