Terminal Bench

The standard benchmark for AI agent evaluation. 10,000+ expert-crafted tasks in stateful Docker environments, used by virtually every frontier lab to measure what their agents can actually do.

2,000+
Tasks
30+
Domains
100+
Contributors

Powered by Caudal's expert network

Caudal's domain experts design the tasks that break frontier models. Not crowdsourced — vetted specialists who understand where AI reasoning falls apart.

Adversarial Task Design

Tasks built to work against LLMs, not help them succeed.

Comparative Signal Analysis

We verify harder tasks still measure what they claim, not just noise.

Oracle-Validated Quality

Every task goes through oracle verification and dual-agent testing before acceptance.

Domain Expert Contributors

Vetted engineers specialized across security, DevOps, backend, and data science.

1,000+ expert-level topics

High-precision data development for the challenges and tasks generalist workflows can't address.

Talk to an expert
Deep Dive

Built by domain experts. Verified by programmatic execution. Designed for frontier reasoning.

Tasks

1000+ verifiable CLI tasks across 30+ domains.

Infrastructure

Secure sandbox with stateful rollbacks.

Evaluation

Automated bash assertions, not LLM-as-a-judge.

Coverage

Sysadmin, DevOps, Data Science, and more.

Our Work

What our contributors build

Every task is an evaluation artifact — designed to work against LLMs, validated by oracle execution, and accepted only if it exposes real reasoning failures. These are not coding problems. They are traps.

Fix a Slow Application

An app that lists books starts freezing as more users visit. The problem? It asks the database hundreds of separate questions when it should ask just two or three. Our contributor finds the bottleneck, restructures how the app talks to the database, and proves it stays fast even under heavy traffic. AI agents usually fix one part but miss the other, leaving the app still slow.

Rebuild an App on a New Foundation

An entire web application needs to move from one technology stack to another — new login system, new way of handling data, new API design. The contributor rewires 15–20 files, replacing old patterns with modern ones while making sure nothing breaks for existing users. AI agents commonly mix old and new patterns together, creating code that looks right but fails under real use.

Upgrade a Core Library Without Breaking Everything

A critical library the app depends on releases a major update that changes how everything works. The contributor traces every ripple effect across the entire codebase and updates each piece to match the new version. AI agents frequently make surface-level fixes that pass initial checks but silently break when real users interact with the app.

Build the benchmark with us

We're looking for domain experts who can design tasks that push AI agents to their limits.