
Will there be an LLM which scores above what a human can do in 2 hours on METR's eval suite before 2026?
Mini
11
Ṁ442Jan 2
75%
chance
1D
1W
1M
ALL
METR has found that current frontier models get a score on their autonomy benchmark roughly similar to a human who is given 30 minutes. Will at least one model score at the level of a human given 2 hours by 2026?

Clarifications:
I will try to resolve this market in accordance with the current task suite. If METR makes the suite harder or easier I will try to account for this in the resolution of this market.
if I am not able to determine the performance of frontier models at the end of 2025, this market will be resolved NA
Get Ṁ1,000 play money
Related questions
Related questions
LLM Hallucination: Will an LLM score >90% on SimpleQA before 2026?
60% chance
Will there be major breakthrough in LLM Continual Learning before 2026?
12% chance
Will an LLM report >50% score on ARC in 2025?
99% chance
Will an LLM agent complete >50% of the lab tasks on the Factorio Learning Environment benchmark in 2025?
48% chance
Will one of the major LLMs be capable of continual lifelong learning (learning from inference runs) by EOY 2025?
4% chance
LLM reaches >90% Brier score on Prophet Arena by 2026?
5% chance
Will LLMs be able to formally verify non-trivial programs by the end of 2025?
21% chance
Will LLMs be better than typical white-collar workers on all computer tasks before 2026?
4% chance
Will a publicly-available LLM achieve gold on IMO before 2026?
20% chance
Will the highest-scoring LLM on Dec 31, 2026 show <10% improvement over 2025's best average benchmark performance?
59% chance