Independent benchmark analysis

Workspace-Bench 1.0

Workspace-Bench evaluates how well AI agents operate in realistic workspaces with large-scale file dependencies. It measures whether agents can discover relevant files, reason across implicit context, modify artifacts safely, and deliver the right output under rubric-based evaluation.

3,854 Files 7,399 Rubrics

80.7%
Human reference
68.7%
Best public agent
47.4%
Average public agent
388
Full benchmark tasks
3,854
Manifest-listed files
7,399
Rubric criteria

Workspace-Bench highlights

Independent benchmark summaries for workspace agents, model backbones, and benchmark composition.

Open Full Leaderboard
Evaluation focus

What the benchmark tests

01Workspace Exploration
02Task-Supporting Files Utilization
03Result-Providing Files Utilization
04Content Relations Understanding
05Semantic Heterogeneous File Understanding
06Lineage Tracing
Performance gap

Human reference remains ahead

Workspace-Bench reports a clear gap between human performance and current evaluated agents.

01Human + Tools80.7%
02Best public agent68.7%
03Average public agent47.4%

Latest benchmark updates

Major milestones and feature additions since the benchmark's public launch.

2026.05.04 Release

Benchmark release

Workspace-Bench and Workspace-Bench-Lite were released together with the public repository, benchmark structure, task metadata, rubrics, and file-dependency information.

2026.05.05 Paper

arXiv paper

The official paper introduced the benchmark design, workspace-learning evaluation dimensions, scoring protocol, Lite split, and public leaderboard framing.

2026.05.07 Leaderboard

Public leaderboard

The public Lite leaderboard was released with agent/model rankings, rubric pass rates, threshold markers, and distribution figures aligned with the released benchmark artifacts.

Framework

The official Workspace-Bench framework figure summarizes how role-specific workspaces, benchmark tasks, agent execution, and capability-oriented evaluation fit together.

Workspace-Bench framework figure

What Workspace-Bench Measures

The benchmark focuses on workspace learning: the ability to identify explicit and implicit file dependencies, integrate context across artifacts, choose the right actions, and produce dependable deliverables in large workspaces.

Workspace Exploration

Navigating deeply nested directory structures and identifying relevant files from noisy candidates.

Task-Supporting Files Utilization

Finding files that provide essential context, references, and domain knowledge needed to complete a task.

Result-Providing Files Utilization

Aggregating result files that contain required outputs, formats, and baseline information.

Content Relations Understanding

Tracing explicit references, semantic connections, and contextual links between related documents.

Semantic Heterogeneous File Understanding

Connecting information across diverse modalities including documents, spreadsheets, presentations, and code.

Lineage Tracing

Understanding file versions, revisions, and derivation relationships (e.g., report_v1, report_final).

Current Performance Gap

Workspace-Bench exposes a large gap between human performance and the best AI agents. The gap grows in profiles with more hidden dependencies, denser file graphs, and stricter rubric requirements.