Methodology
Workspace-Bench evaluates workspace learning: whether an agent can discover, connect, update, and validate information across realistic file ecosystems while producing rubric-graded deliverables.
| Stage | Benchmark object | What is measured | Failure mode |
|---|---|---|---|
| Task initialization | Split, task metadata, workspace files | Correct benchmark context and starting state | Wrong split, wrong workspace snapshot |
| Workspace interaction | Agent actions over heterogeneous files | Search, reading, editing, tracing, synthesis | Shallow lookup or missed dependencies |
| Deliverable production | Output files and final answer | Task completion and artifact correctness | Partial, inconsistent, or malformed output |
| Rubric grading | Rubrics attached to each task | Fine-grained pass/fail evidence per requirement | Unverified claims or incomplete evidence |
| Score aggregation | Task outcomes across a split | Rubric pass rate and task-level success | Over-reading one metric without the other |
The benchmark is not simple QA over one document. Success depends on navigating file ecosystems and linking implicit evidence.
Rubrics make partial failures visible: correctness, completeness, evidence use, constraint satisfaction, and output quality are checked separately.
The official release currently exposes 388 full tasks and 100 Lite tasks. Lite is the public repeated-measurement slice used for the current leaderboard.
Scoring Dimensions
The paper separates workspace ability into interpretable dimensions so leaderboard numbers can be explained by concrete capability gaps.
Workspace Exploration
Can the system search large workspaces, identify relevant folders and files, and avoid getting trapped by irrelevant surface matches.
Task-Providing File Utilization
Can it find files that define requirements, constraints, policies, or context needed to understand what the task really asks for.
Result-Providing Files Utilization
Can it retrieve data-bearing files whose contents must be extracted, merged, summarized, or transformed into the final deliverable.
Semantic Content Relations Understanding
Can it connect semantically related files even when names differ, references are indirect, or information is distributed across sources.
Heterogeneous File Understanding
Can it reason across documents, spreadsheets, code, slides, PDFs, images, and mixed-format work products in one task flow.
Lineage Tracing
Can it identify version relationships, derived artifacts, and update chains such as drafts, summaries, exports, and later revisions.
Public Metrics
The current public site and leaderboard rely on the released paper summary and Lite rubric-pass rows, so the metric definitions need to stay explicit.
Rubric pass rate
The share of released rubric checks satisfied by a system. This is the main public metric exposed in the Lite leaderboard figure.
Task success rate
Task-level completion derived from rubric outcomes or paper aggregation. It is reported in the paper summary but not yet released as a full public per-system table.
Verification status
Rows on this site are limited to paper-reported or repository-released public results. Submission-based verification is not open yet.