Evaluation pipeline
Paper-aligned protocol
Stage Benchmark object What is measured Failure mode
Task initializationSplit, task metadata, workspace filesCorrect benchmark context and starting stateWrong split, wrong workspace snapshot
Workspace interactionAgent actions over heterogeneous filesSearch, reading, editing, tracing, synthesisShallow lookup or missed dependencies
Deliverable productionOutput files and final answerTask completion and artifact correctnessPartial, inconsistent, or malformed output
Rubric gradingRubrics attached to each taskFine-grained pass/fail evidence per requirementUnverified claims or incomplete evidence
Score aggregationTask outcomes across a splitRubric pass rate and task-level successOver-reading one metric without the other
Workspace Learning

The benchmark is not simple QA over one document. Success depends on navigating file ecosystems and linking implicit evidence.

Rubric-based Scoring

Rubrics make partial failures visible: correctness, completeness, evidence use, constraint satisfaction, and output quality are checked separately.

Full vs Lite Benchmark

The official release currently exposes 388 full tasks and 100 Lite tasks. Lite is the public repeated-measurement slice used for the current leaderboard.

Scoring Dimensions

The paper separates workspace ability into interpretable dimensions so leaderboard numbers can be explained by concrete capability gaps.

Dimension 01

Workspace Exploration

Can the system search large workspaces, identify relevant folders and files, and avoid getting trapped by irrelevant surface matches.

Dimension 02

Task-Providing File Utilization

Can it find files that define requirements, constraints, policies, or context needed to understand what the task really asks for.

Dimension 03

Result-Providing Files Utilization

Can it retrieve data-bearing files whose contents must be extracted, merged, summarized, or transformed into the final deliverable.

Dimension 04

Semantic Content Relations Understanding

Can it connect semantically related files even when names differ, references are indirect, or information is distributed across sources.

Dimension 05

Heterogeneous File Understanding

Can it reason across documents, spreadsheets, code, slides, PDFs, images, and mixed-format work products in one task flow.

Dimension 06

Lineage Tracing

Can it identify version relationships, derived artifacts, and update chains such as drafts, summaries, exports, and later revisions.

Public Metrics

The current public site and leaderboard rely on the released paper summary and Lite rubric-pass rows, so the metric definitions need to stay explicit.

Rubric pass rate

The share of released rubric checks satisfied by a system. This is the main public metric exposed in the Lite leaderboard figure.

Task success rate

Task-level completion derived from rubric outcomes or paper aggregation. It is reported in the paper summary but not yet released as a full public per-system table.

Verification status

Rows on this site are limited to paper-reported or repository-released public results. Submission-based verification is not open yet.

Reproducibility Requirements. A future verified submission should declare split, benchmark version, harness name, model identifier, runtime configuration, workspace constraints, logs, final outputs, and a stable report link or replay route. Until that contract is published, the official site is restricted to released public benchmark artifacts.