By the end of Q1 2025 will an open source model beat OpenAI’s o1 model?

Plus

134

Ṁ16k

resolved Nov 27

Resolved

YES

ALL

DeepSeek releases R1-Lite

By March 31, 2025, will an open-source AI model—with weights available for commercial use and requiring attribution similar to Meta’s Llama—be released that outperforms OpenAI’s new o1-preview model on established benchmarks?

1. Time Frame: deadline as the end of the first quarter in 2025 (March 31, 2025).

2. Criteria for the Open-Source Model:

- Availability: Weights must be available for commercial use.

- Attribution and license: must resemble what Meta and others have previously done in the past.

3. Performance Benchmark: The model must outperform OpenAI’s new o1-preview model on established benchmarks (at least 2 major ones) that it currently leads on.

#AI

#️ Technology

#OpenAI

#Technical AI Timelines

Get Ṁ1,000 play money

🏅 Top traders

#	Name	Total profit
1		Ṁ639
2		Ṁ331
3		Ṁ324
4		Ṁ251
5		Ṁ248

28 Comments

Sort by:

Qwen's QwQ has met all criteria for the challenge.

- It beats 01-preview on both AIME and MATH-500
- Is available for commercial use under the Apache 2.0 license
- Has weights available on Hugging Face

https://qwenlm.github.io/blog/qwq-32b-preview/

@JohnL ah this question is for o1-preview and not o1 (as the title says)?

@MalachiteEagle the only model available with metrics of o1-preview. This is also made clear in the description.

@JohnL Then there is a conflict between the title and the description for this question

What model resolved the market and what were the benchmarks?

Please god, open source DeepSeek R1 Lite.

opened a Ṁ116 YES at 50% order

Score for o1 just posted at 1355. It's further ahead than I thought, so I think this might make this market less likely to resolve Yes.

O1 seems particularly strong at coding tasks, so you should probably specify which benchmark you will use

Misleading title. The resolution is about o1-preview, not about o1.

And I bet on the previous spec before it was changed. I would prefer this market be N/Aed and a fresh market made

bought Ṁ50 NO

"3. Performance Benchmark: The model must outperform OpenAI’s new o1-preview model on established benchmarks (at least 2 major ones) that it currently leads on."

There are so many ways of evaluating, and so many benchmarks out there. IMO a lot to gain from specifying concretely e.g. lmsys code, GPQA, SWE-bench etc. @JohnL ? Probably worth further specifying: use best available result (any scaffold) at time of resolution for both O1 and OSS contender.

https://github.com/bklieger-groq/g1 potentially related?

GitHub - bklieger-groq/g1: g1: Using Llama-3.1 70b on Groq to create o1-like reasoning chains

g1: Using Llama-3.1 70b on Groq to create o1-like reasoning chains - bklieger-groq/g1

@jerkyenox when the independent rivals drop tag me!

@JohnL evals*

The model must outperform OpenAI’s o1 preview (or full) model on at least two widely recognized AI benchmarks

That's already true today. https://github.com/openai/simple-evals?tab=readme-ov-file#benchmark-results

Mgsm and drop are higher on llama 405

@Usaar33 edited to be about beating current benchmarks

bought Ṁ75 YES

o1 is specialized for stem so sucks at creative writing. Wouldnt be surprising for any model to beat it at two major creative writing benchmarks, or something like that

@Bayesian oh, this seems like a loophole to close, relative to the spirit of this market

maybe. If so i think the title or description should make clear that the open source model has to beat o1 on things o1 is good at

@Bayesian Agreed. Hmm, now description says 'that it currently leads on' which still isn't as clear as I'd like

@JamesBaker3 Could say 2 of [AIME, Codeforces, GPQA Diamond, MMMU]?

@JamesBaker3 alternative?

@Bayesian check my edit and give me thoughts

@JohnL yeah I think this is fine.

Does 'o1' refer to the recently released 'o1-preview', the upcoming 'o1' (which OpenAI has claimed to be meaningfully better than 'o1-preview'), or whatever the best iteration of o1 is that is publicly available by the deadline? What happens in the second case if 'o1' isn't released by the deadline?

@axiomnull clarified to be preview

@JohnL woah, I assumed (and bet on) o1 because you wrote o1.

Because that's not a "clarification", that's a substantial change from one model to a different model

🏅 Top traders

Related questions

Related questions