Workspace-Bench Methodology

Evaluation pipeline

Reproducible workflow

Stage	Input	Output	Failure mode
Workspace Setup	Selected split and task files	Controlled workspace state	Wrong benchmark context
Agent Execution	Harness, model, tools, limits	Logs and intermediate actions	Undeclared runtime conditions
Output Collection	Modified artifacts and answer	Deliverable bundle	Missing evidence trail
Rubric Evaluation	Deliverables and rubrics	Task and overall scores	Unverifiable scoring basis
Verification	Configs, logs, cost, runtime	Maintainer-reviewable report	Non-reproducible submission

Workspace Learning

Success depends on explicit and implicit file relationship understanding, not just direct answer lookup.

Rubric-based Scoring

Evaluation checks correctness, evidence use, completeness, policy adherence, and consistency.

Full vs Lite Benchmark

388 full tasks; 100 Lite tasks for lower-cost repeated measurement.

Scoring Dimensions

Scoring dimensions are intentionally separable so benchmark failures reveal where a system breaks down.

Dimension 01

Workspace Exploration

Navigating deeply nested directory structures and identifying relevant files from noisy candidates.

Dimension 02

Task-Supporting Files Utilization

Finding files that provide essential context, references, and domain knowledge needed to complete a task.

Dimension 03

Result-Providing Files Utilization

Aggregating result files that contain required outputs, formats, and baseline information.

Dimension 04

Content Relations Understanding

Tracing explicit references, semantic connections, and contextual links between related documents.

Dimension 05

Semantic Heterogeneous File Understanding

Connecting information across diverse modalities including documents, spreadsheets, presentations, and code.

Dimension 06

Lineage Tracing

Understanding file versions, revisions, and derivation relationships (e.g., report_v1, report_final).

Evaluation Pipeline

A verified run needs an inspectable workflow, not just a final score.

1

Initialize

Prepare benchmark split, workspace constraints, and selected task environment.

2

Execute

Run the declared harness and model with stable tool access and budget settings.

3

Collect

Capture logs, edited files, generated outputs, and task metadata.

4

Grade

Apply rubric-based evaluation and aggregate per-task outcomes into benchmark metrics.

5

Verify

Attach runtime configuration, versioning, cost, and latency metadata.

6

Release

Publish a stable report link that maintainers can inspect or replay.

Reproducibility Requirements. Verified submissions should include the benchmark split, harness name, model identifier, runtime configuration, workspace constraints, logs, final outputs, cost and latency summary, and a stable report link that lets maintainers replay or inspect the run.