Before 2028, will any AI model achieve the same or greater benchmarks as o3 high with <= 1 million tokens per question?
➕
Plus
8
Ṁ2361
2028
86%
chance

Specifically, the key benchmarks here are ARC, Codeforces elo, and Frontier Math score. The relevant scores are 2727 codeforces elo, 87.5% on arc semi-private, and 25.2% on Frontier Math.

The model must achieve these benchmarks while using no more than 1,000,000 reasoning tokens per question on average.

For context, o3 used 5.7B tokens per task to achieve its ARC score. It also scored 75.7% on low compute mode using 33M tokens per task.

https://arcprize.org/blog/oai-o3-pub-breakthrough

Also note that if the final version of o3 has improved or worsened benchmarks the goalposts will not change. The model must beat the benchmarks listed here.

Get Ṁ1,000 play money
Sort by:

Should I resolve this n/a? It is clear now that o3 high does not actually use millions of tokens per question and that number referred to consensus @1024 prompting

@JaundicedBaboon i vote yes

filled a Ṁ350 YES at 88% order

I'm not certain that someone will run these evals and report token counts, but in underlying capabilities I'm about 99.5% confident of this.