Independent benchmark analysis

Workspace-Bench 1.0

Workspace-Bench evaluates how well AI agents operate in realistic workspaces with large-scale file dependencies. It measures whether agents can discover relevant files, reason across implicit context, modify artifacts safely, and deliver the right output under rubric-based evaluation.

View Leaderboard GitHub arXiv Hugging Face Dataset Dataset Submit Results

80.7%

Human reference

68.7%

Best public agent

47.4%

Average public agent

388

Full benchmark tasks

3,854

Manifest-listed files

7,399

Rubric criteria

Workspace-Bench highlights

Independent benchmark summaries for workspace agents, model backbones, and benchmark composition.

Open Full Leaderboard

Evaluation focus

What the benchmark tests

01Workspace Exploration

02Task-Supporting Files Utilization

03Result-Providing Files Utilization

04Content Relations Understanding

05Semantic Heterogeneous File Understanding

06Lineage Tracing

Performance gap

Human reference remains ahead

Workspace-Bench reports a clear gap between human performance and current evaluated agents.

01Human + Tools80.7%

02Best public agent68.7%

03Average public agent47.4%

Latest benchmark updates

Major milestones and feature additions since the benchmark's public launch.

2026.05.04 Release

Benchmark release

Workspace-Bench and Workspace-Bench-Lite were released together with the public repository, benchmark structure, task metadata, rubrics, and file-dependency information.

2026.05.05 Paper

arXiv paper

The official paper introduced the benchmark design, workspace-learning evaluation dimensions, scoring protocol, Lite split, and public leaderboard framing.

2026.05.07 Leaderboard

Public leaderboard

The public Lite leaderboard was released with agent/model rankings, rubric pass rates, threshold markers, and distribution figures aligned with the released benchmark artifacts.

Explore Workspace-Bench

Jump into the official leaderboard, dataset composition, and scoring methodology from a single entry section.

Leaderboard

Official rankings

Browse public Lite systems, full-summary rows, threshold views, and capability-oriented leaderboard analysis.

Dataset

Task composition

Inspect worker profiles, collaboration types, rubric distributions, file dependency coverage, and task metadata structure.

Methodology

Evaluation protocol

Review the rubric-based scoring process, submission requirements, and how benchmark evidence is structured.

Framework

The official Workspace-Bench framework figure summarizes how role-specific workspaces, benchmark tasks, agent execution, and capability-oriented evaluation fit together.

What Workspace-Bench Measures

The benchmark focuses on workspace learning: the ability to identify explicit and implicit file dependencies, integrate context across artifacts, choose the right actions, and produce dependable deliverables in large workspaces.

Workspace Exploration

Navigating deeply nested directory structures and identifying relevant files from noisy candidates.

Task-Supporting Files Utilization

Finding files that provide essential context, references, and domain knowledge needed to complete a task.

Result-Providing Files Utilization

Aggregating result files that contain required outputs, formats, and baseline information.

Content Relations Understanding

Tracing explicit references, semantic connections, and contextual links between related documents.

Semantic Heterogeneous File Understanding

Connecting information across diverse modalities including documents, spreadsheets, presentations, and code.

Lineage Tracing

Understanding file versions, revisions, and derivation relationships (e.g., report_v1, report_final).

Current Performance Gap

Workspace-Bench exposes a large gap between human performance and the best AI agents. The gap grows in profiles with more hidden dependencies, denser file graphs, and stricter rubric requirements.

Resources

Use the benchmark website as a compact index to the paper, repository, dataset composition, and submission process.