Required Files

Result JSON, report link, logs, environment configuration, and task split details.

Recommended Files

Run scripts, cost summary, patched artifacts, and a short write-up of known failure cases.

Verified Results

Verified submissions should be reproducible by maintainers from the provided materials.

Submission Checklist

Every submission should make the evaluation setup explicit.

Agent Metadata

Agent harness name, backbone model, version, tool permissions, and prompt or policy configuration.

Environment Metadata

Runtime versions, hardware notes if relevant, benchmark split, cost budget, timeout policy, and workspace limits.

Execution Artifacts

Logs, final deliverables, patch summaries, and any auxiliary validation outputs generated during the run.

Reproduction Route

A stable report URL or repository path that maintainers can use to replay or inspect the run end to end.

Result JSON Schema

{ "agent": "string", "harness": "string", "model": "string", "benchmark": "Workspace-Bench", "split": "full | lite", "overall_score": 0.0, "task_success_rate": 0.0, "rubric_pass_rate": 0.0, "cost_usd": 0.0, "runtime_minutes": 0.0, "workspace_size": "string", "date": "YYYY-MM-DD", "verified": false, "report_url": "string" }
Submit benchmark questions or result packages through the project repository: OpenDataBox/Workspace-Bench.