What is Grok 4's performance on METR's task length evaluation?
10
Ṁ6242026
1D
1W
1M
ALL
40%
0 to 1.5 Hours
40%
1.5 to 2 Hours
15%
2 to 2.5 Hours
4%
2.5 to 3 Hours
1.3%
More than 3 Hours
Resolves based on the METR's measurement of the duration of tasks that can complete with a 50% success rate.
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
Grok 4 Heavy does not count
Get Ṁ1,000 play money
Sort by: