Will there be an LLM which scores above what a human can do in 2 hours on METR's eval suite before 2026? | Manifold

Will there be an LLM which scores above what a human can do in 2 hours on METR's eval suite before 2026?

Mini

11

Ṁ442

Jan 2

75%

chance

1D

1W

1M

ALL

METR has found that current frontier models get a score on their autonomy benchmark roughly similar to a human who is given 30 minutes. Will at least one model score at the level of a human given 2 hours by 2026?

Clarifications:

I will try to resolve this market in accordance with the current task suite. If METR makes the suite harder or easier I will try to account for this in the resolution of this market.
if I am not able to determine the performance of frontier models at the end of 2025, this market will be resolved NA

#Technical AI Timelines

#Machine Learning

Get Ṁ1,000 play money

Related questions

LLM Hallucination: Will an LLM score >90% on SimpleQA before 2026?

Will an LLM report >50% score on ARC in 2025?

Will one of the major LLMs be capable of continual lifelong learning (learning from inference runs) by EOY 2025?

Will LLMs be able to formally verify non-trivial programs by the end of 2025?

Will a publicly-available LLM achieve gold on IMO before 2026?

Will there be major breakthrough in LLM Continual Learning before 2026?

Will an LLM agent complete >50% of the lab tasks on the Factorio Learning Environment benchmark in 2025?

LLM reaches >90% Brier score on Prophet Arena by 2026?

Will LLMs be better than typical white-collar workers on all computer tasks before 2026?

Will the highest-scoring LLM on Dec 31, 2026 show <10% improvement over 2025's best average benchmark performance?

Related questions

LLM Hallucination: Will an LLM score >90% on SimpleQA before 2026?

Will there be major breakthrough in LLM Continual Learning before 2026?

Will an LLM report >50% score on ARC in 2025?

Will an LLM agent complete >50% of the lab tasks on the Factorio Learning Environment benchmark in 2025?

Will one of the major LLMs be capable of continual lifelong learning (learning from inference runs) by EOY 2025?

LLM reaches >90% Brier score on Prophet Arena by 2026?

Will LLMs be able to formally verify non-trivial programs by the end of 2025?

Will LLMs be better than typical white-collar workers on all computer tasks before 2026?

Will a publicly-available LLM achieve gold on IMO before 2026?

Will the highest-scoring LLM on Dec 31, 2026 show <10% improvement over 2025's best average benchmark performance?