Will an AI win a gold medal on International Math Olympiad (IMO) 2025?
Will an AI win a gold medal on International Math Olympiad (IMO) 2025?
➕
Plus
119
Ṁ90k
Aug 20
66%
chance

Will an AI score well enough on the 2025 International Mathematics Olympiad (IMO) to earn a gold medal score (top ~50 human performance)? Resolves YES if this result is reported no later than 1 month after IMO 2025 (currently scheduled for July 10-20). The AI must complete this task under the same time limits as human competitors. The AI may receive and output either informal or formal problems and proofs. More details below. Otherwise NO.

This is related to https://imo-grand-challenge.github.io/ but with some different rules.

Rules:

  • The result must be achieved on the IMO 2025 problemset and be reported by reliable publications no later than 1 month after the end of the IMO contest dates (https://www.imo-official.org/organizers.aspx, so by end of August 20 2025, if the IMO does not reschedule its date. Local timezone at the contest site).

  • The AI has only as much time as a human competitor (4.5 hours for each of the two sets of 3 problems), but there are no other limits on the computational resources it may use during that time.

  • The AI may receive and output either informal (natural language) or formal (e.g. the Lean language) problems as input and proofs as output.

  • The AI cannot query the Internet.

  • The AI must not have access to the problems before being evaluated on them, e.g. the problems cannot be included in the training set.

    • (The deadline of 1 month after the competition is intended to give enough time for results to be finalized and published, while minimizing the chances of any accidental inclusion of the IMO solutions in the training set.)

  • If a gold medal score is achieved on IMO 2024 or an earlier IMO, that would not count for this market.

Get Ṁ1,000 play money


Sort by:
11d

Leading LLMs get <5% scores on USAMO (which selects participants for the IMO): https://arxiv.org/abs/2503.21934

@pietrokc yeah I saw this, very strange - hard to see how this dovetails with the really high performance we see elsewhere - I mean it seems to just speak to train / test contamination

but was frontier math also contaminated?

11d

current llms trained with rl for reasoning largely do it on short solution based problems, not proof based problems. so they learn to take shortcuts. for proof based problems they are currently pretty bad. that is the essence of the difference ; frontiermath is not proof-based. USAMO is proof based. LLMs currently do good on one, bad on the other. For proofs, the best current systems seem not to be LLMs, but systems like alphaproof by google.

bought Ṁ500 YES10d

@Bayesian yeah, the way AlphaProof/AlphaGeometry avoid making reasoning mistakes is simply by requiring formal proofs, instead of LLMs which generate informal proofs.

10d

@Bayesian It's very misleading to say that FrontierMath is not proof-based. Of course it's proof-based. All real math is proof-based. They just ask that the proof be of a fact of the form "certain definition picks out a certain number", to make it easier to check automatically.

@CampbellHutcheson There's been a lot of controversy about FrontierMath which I'll not rehash here. In my personal experience, all models fall FAR short of claims that they can do "research-level math that would take a professional mathematician hours or days". They routinely fail relatively trivial things whenever I test them. I have also tried to earnestly use them to learn actual math, like, existing fields that I'm just not that familiar with. I have found them to be worse than useless at that, because they'll confidently state falsehoods that take effort to disprove.

bought Ṁ150 YES9d

@pietrokc Gemini 2.5 does a lot better. About 25%

@Usaar33 USAMO 2025 was on 19-20 March. Then someone evaluates Gemini 2.5 on 2 April and it does massively better than models released before that date. What conclusion do you want to draw from this?

9d

@pietrokc I am not following. How is FrontierMath proof based? They don’t look at the reasoning, only at whether the answer was correct. The ai can find the right answer by coincidence or by wrong reasoning cancelling out and it’s still graded as correct. Unlike with proofs

8d

@pietrokc

It's very misleading to say that FrontierMath is not proof-based. Of course it's proof-based.

Bayesian is correct—obviously most math is “proof-based” in some trivial sense that isn’t really relevant here. What matters here is whether you are scored based on the correctness of the complete proof you produce. Many recent LLMs have struggled on such tests (even when they have performed well on other math that just require a narrow correct answer—you are welcome to call that number a “proof”, fair enough, but it’s not the relevant distinction here)

12d

Human contestants only get one chance to answer each question. I hope that means AI will not be judged on pass@k, where it gets k>1 chances to give a correct proof and gets points if at least one is correct. Each AI should also be judged on one submission for each question, right?

opened a Ṁ2,000 NO at 59% order8mo

2000 limit no at 59%

8mo

“no later than 1 month after the end of the IMO contest (so by end of August 20 2025, if the IMO does not reschedule its date).”

  1. What time zone?

  2. Even though the timeline on the website has like 10 days, the actual contest is some time in the middle of those dates, so there’s technically a month and a few days where the problems are available.

Good questions.

  1. Local timezone at contest site.

  2. I'm going to use the end of the IMO dates as written, from https://www.imo-official.org/organizers.aspx even though the actual contest is in the middle, because that's what I wrote and it doesn't really matter the exact number of days.

bought Ṁ350 YES8mo

So by these criteria, it's fine if the AI isn't finalized before the IMO, as long as it doesn't train on the IMO problems? This seems like it opens the possibility for small tweaks to the program to be made that bias the algorithm to be better at some tasks than others, and for the nature of these tweaks to depend on the content of the problems.

8mo

Right, you could just try many versions of something like this year's AlphaProof, and one would very likely qualify by chance.

This is also unlikely to be something the public or "reliable publications" could verify (hence the open source requirement for the IMO Grand Challenge), so it seems we'd just be taking the AI developer's word for it.

Note that in a lot of IMO criteria, like Eliezer's, the AI can be produced long after the contest and you mostly just have to trust the AI developers on whether they cheated.

While you can run multiple versions, you could already do that anyway, the only difference is that you might have humans decide different tweaks to try based on the problems (juicing the evals) or sort of cheat the time limits by not counting the time used for earlier versions you tried. So at least the cheats are much more limited.

Most models are closed and it is quite likely that the model will never be published, unless they are specifically going for the IMO grand challenge. So it's very hard to set requirements around the AI being finalized before the competition, unless you have an open model requirement.

Right, you could just try many versions of something like this year's AlphaProof, and one would very likely qualify by chance.

I highly doubt it would be able to solve the combinatorics problems no matter how many versions you tried.

And if that worked, then your winning AI system is just the collection of versions of subagents. (Assuming as mentioned above that you don't have humans deciding the tweaks based on the questions, and that it isn't cheating the time controls)

Overall I think my criteria balance false positive (cheating) and false negative potential about as well as possible. I haven't seen or thought of any verification requirements that would have prevented the hypothetical cheating scenarios above and allowed IMO silver to resolve yes on the deep mind announcement (if it had met the time controls), and I definitely want my question to resolve yes on that

8mo

@jack

I highly doubt it would be able to solve the combinatorics problems no matter how many versions you tried.

We are probably referring to different levels of model capabilities. I see a lot of probability mass on models that are correct, say, 5-50% of the time.

I'd agree that trying to resolve YES on the recent GDM announcement makes it hard to use strict criteria.

Yeah, I was referring to capability at the level of alpha proof right now.

Alpha proof is already trying tons of different proof strategies and checking to see what works!

Similar to this but with better, clearer resolution criteria and earlier deadline

See also recent news https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/

And another market on whether an AI will hit 1st place on the IMO:

What is this?

What is Manifold?
Manifold is a social prediction market with real-time odds on wide ranging news such as politics, tech, sports and more!
Participate for free in sweepstakes markets to win sweepcash which can be withdrawn for real money!
Are our predictions accurate?
Yes! Manifold is very well calibrated, with forecasts on average within 4 percentage points of the true probability. Our probabilities are created by users buying and selling shares of a market.
In the 2022 US midterm elections, we outperformed all other prediction market platforms and were in line with FiveThirtyEight’s performance. Many people who don't like trading still use Manifold to get reliable news.
Why should I trade?
Trading contributes to accurate answers of important, real-world questions and helps you stay more accountable as you make predictions.
Trade with S Sweepcash (𝕊) for a chance to win withdrawable cash prizes.
Get started for free! No credit card required.
What are sweepstakes markets?
There are two types of markets on Manifold: play money and sweepstakes.
By default all markets are play money and use mana. These markets allow you to win more mana but do not award any prizes which can be cashed out.
Selected markets will have a sweepstakes toggle. These require sweepcash to participate and allow winners to withdraw any sweepcash won to real money.
As play money and sweepstakes markets are independent of each other, they may have different odds even though they share the same question and comments.
Learn more.