Methodology
Workspace-Bench defines workspace learning as the ability to identify, reason over, update, and validate dependencies among large collections of files while producing task-correct deliverables.
| Stage | Input | Output | Failure mode |
|---|---|---|---|
| Workspace Setup | Selected split and task files | Controlled workspace state | Wrong benchmark context |
| Agent Execution | Harness, model, tools, limits | Logs and intermediate actions | Undeclared runtime conditions |
| Output Collection | Modified artifacts and answer | Deliverable bundle | Missing evidence trail |
| Rubric Evaluation | Deliverables and rubrics | Task and overall scores | Unverifiable scoring basis |
| Verification | Configs, logs, cost, runtime | Maintainer-reviewable report | Non-reproducible submission |
Success depends on explicit and implicit file relationship understanding, not just direct answer lookup.
Evaluation checks correctness, evidence use, completeness, policy adherence, and consistency.
388 full tasks; 100 Lite tasks for lower-cost repeated measurement.
Scoring Dimensions
Scoring dimensions are intentionally separable so benchmark failures reveal where a system breaks down.
Workspace Exploration
Navigating deeply nested directory structures and identifying relevant files from noisy candidates.
Task-Supporting Files Utilization
Finding files that provide essential context, references, and domain knowledge needed to complete a task.
Result-Providing Files Utilization
Aggregating result files that contain required outputs, formats, and baseline information.
Content Relations Understanding
Tracing explicit references, semantic connections, and contextual links between related documents.
Semantic Heterogeneous File Understanding
Connecting information across diverse modalities including documents, spreadsheets, presentations, and code.
Lineage Tracing
Understanding file versions, revisions, and derivation relationships (e.g., report_v1, report_final).
Evaluation Pipeline
A verified run needs an inspectable workflow, not just a final score.
Initialize
Prepare benchmark split, workspace constraints, and selected task environment.
Execute
Run the declared harness and model with stable tool access and budget settings.
Collect
Capture logs, edited files, generated outputs, and task metadata.
Grade
Apply rubric-based evaluation and aggregate per-task outcomes into benchmark metrics.
Verify
Attach runtime configuration, versioning, cost, and latency metadata.
Release
Publish a stable report link that maintainers can inspect or replay.