Official Leaderboard
Workspace-Bench compares AI agents on realistic workspace tasks with large-scale file dependencies. Scores below combine task-level success and rubric-level grading. Rows marked Verified correspond to paper-reported or maintainers-verified results.
Public Lite Rankings
Public Workspace-Bench-Lite harness/model rows generated from the latest detailed rubrics pass table.
Workspace-Bench Leaderboards
Framework x Model Matrix
Matrix view of public Workspace-Bench-Lite rubric pass rates. Blank cells mean the latest detailed result table does not contain that framework/model combination.
Threshold Views
Compare average passed Lite tasks under each rubric threshold using the detailed pass_at columns.
Public threshold summary
Composition Analysis
Compare full and Lite task-ability composition directly from the latest official metadata analysis.
Leaderboard Analysis
Real secondary views derived from released Lite leaderboard rows and benchmark metadata.