Evaluation pipeline
Reproducible workflow
Stage Input Output Failure mode
Workspace SetupSelected split and task filesControlled workspace stateWrong benchmark context
Agent ExecutionHarness, model, tools, limitsLogs and intermediate actionsUndeclared runtime conditions
Output CollectionModified artifacts and answerDeliverable bundleMissing evidence trail
Rubric EvaluationDeliverables and rubricsTask and overall scoresUnverifiable scoring basis
VerificationConfigs, logs, cost, runtimeMaintainer-reviewable reportNon-reproducible submission
Workspace Learning

Success depends on explicit and implicit file relationship understanding, not just direct answer lookup.

Rubric-based Scoring

Evaluation checks correctness, evidence use, completeness, policy adherence, and consistency.

Full vs Lite Benchmark

388 full tasks; 100 Lite tasks for lower-cost repeated measurement.

Scoring Dimensions

Scoring dimensions are intentionally separable so benchmark failures reveal where a system breaks down.

Dimension 01

Workspace Exploration

Navigating deeply nested directory structures and identifying relevant files from noisy candidates.

Dimension 02

Task-Supporting Files Utilization

Finding files that provide essential context, references, and domain knowledge needed to complete a task.

Dimension 03

Result-Providing Files Utilization

Aggregating result files that contain required outputs, formats, and baseline information.

Dimension 04

Content Relations Understanding

Tracing explicit references, semantic connections, and contextual links between related documents.

Dimension 05

Semantic Heterogeneous File Understanding

Connecting information across diverse modalities including documents, spreadsheets, presentations, and code.

Dimension 06

Lineage Tracing

Understanding file versions, revisions, and derivation relationships (e.g., report_v1, report_final).

Evaluation Pipeline

A verified run needs an inspectable workflow, not just a final score.

1

Initialize

Prepare benchmark split, workspace constraints, and selected task environment.

2

Execute

Run the declared harness and model with stable tool access and budget settings.

3

Collect

Capture logs, edited files, generated outputs, and task metadata.

4

Grade

Apply rubric-based evaluation and aggregate per-task outcomes into benchmark metrics.

5

Verify

Attach runtime configuration, versioning, cost, and latency metadata.

6

Release

Publish a stable report link that maintainers can inspect or replay.

Reproducibility Requirements. Verified submissions should include the benchmark split, harness name, model identifier, runtime configuration, workspace constraints, logs, final outputs, cost and latency summary, and a stable report link that lets maintainers replay or inspect the run.