Will the ARC Prize Foundation succeed at making a new benchmark that is easy for humans but still hard for the best AIs?
Will the ARC Prize Foundation succeed at making a new benchmark that is easy for humans but still hard for the best AIs?
➕
Plus
28
Ṁ8437
2026
82%
chance

https://techcrunch.com/2025/01/08/ai-researcher-francois-chollet-is-co-founding-a-nonprofit-to-build-benchmarks-for-agi/

Specifically, this resolves YES if:

(1) A new benchmark is announced before the end of 2025; and

(2) The best AI result published within three months after the announcement is less than half of the human-level target. (For example, if human-level performance is claimed to be 80%, an AI will need to reach at least 40%.)

If multiple new benchmarks are created in 2025, this will resolve YES if condition 2 is true for any of them.

  • Update 2025-09-01 (PST) (AI summary of creator comment): Resolution Criteria Update:

    • CPU or compute cost caps will be ignored when evaluating the AI performance.

  • Update 2025-03-24 (PST) (AI summary of creator comment): Human-Level Target Update:

    • The human performance target is defined as 60% based on TechCrunch's report.

    • Consequently, condition (2) will be evaluated as the best AI result needing to be less than half of 60% (i.e., below 30%).

  • Update 2025-03-24 (PST) (AI summary of creator comment): Human Performance Target and AI Score Requirement

    • The market expects an average human performance of 60%, meaning the AI system must score below 30% within three months of a new benchmark announcement.

Compute Resources Ignored

  • Any amount of compute resources is acceptable; CPU or compute cost caps are ignored in the evaluation.

Multiple Benchmark Clause

  • If multiple new benchmarks are released during 2025, the market cannot resolve as NO until the end of the year, acknowledging that later benchmarks could provide additional opportunities for meeting condition (2).

Get Ṁ1,000 play money


Sort by:

@traders TechCrunch reports average human performance of 60%, so this will resolve YES if no AI system using any amount of compute resources reaches a score of 30% in the next three months.

https://techcrunch.com/2025/03/24/a-new-challenging-agi-test-stumps-most-ai-models/

But note that I also added a clause for multiple new benchmarks being released this year. So this can't resolve NO until the end of the year, in case another benchmark is created that meets the criteria.

19d

Note: There's some discussion on the ARC Prize Discord about mistakes in the tests. So far, 8/120 tasks in the public evaluation set have been updated.

I don't know yet if they plan to fix anything in the private tests.

opened a Ṁ10,000 NO at 80% order20d

Fill me at 80%

20d

ngl i'm a coward and moved it up to 90%. 3 months isn't that long. but it's getting to 50% by EOY idk

bought Ṁ250 YES20d

@Bayesian if someone makes a market on 50% by EOY I'd bet against you there! (not at 80%, but at somewhere around even odds)

20d

@traders TechCrunch reports average human performance of 60%, so this will resolve YES if no AI system using any amount of compute resources reaches a score of 30% in the next three months.

https://techcrunch.com/2025/03/24/a-new-challenging-agi-test-stumps-most-ai-models/

But note that I also added a clause for multiple new benchmarks being released this year. So this can't resolve NO until the end of the year, in case another benchmark is created that meets the criteria.

20d

@mathvc Thanks, the clock starts today on the three months to see what the best AI result is.

I'm not sure how to define the "human performance" target though. I understand every question was solved successfully by at least two people, but I'm not sure what the average correct percentage is.

20d

Ah, TechCrunch reports a human average of 60%, so I'll go with that.

3mo

Just fyi: it is very easy to find tasks that make current llms fail. In particular, if you give them two very big almost identical texts with 5 changes, they will fail to identify the changes.

Even worse is visual reasoning due to limited data and transformer unfriendly formats.

3mo

@mathvc o3 mostly succeeded at visual reasoning for the original ARC-AGI benchmark though. I'm curious how much harder they can make it while still keeping it easy for humans to solve.

3mo

@TimothyJohnson5c16 ARC is special kind of visual reasoning (discrete 2D grid). There are many visual tasks reasoning beyond that

3mo

ARC-AGI-1 had a $10,000 cap on compute cost. If ARC-AGI-2 has a similar cap, but a system exceeds the half-of-human target by spending more than the cap, does that still resolve YES?

3mo

@Nick6d8e Hmm, good question. I'm interested in comparing with o3's performance on ARC-AGI-1, and I understand they spent up to $1,000 per question, so I think I'll ignore the CPU cap.

What is this?

What is Manifold?
Manifold is a social prediction market with real-time odds on wide ranging news such as politics, tech, sports and more!
Participate for free in sweepstakes markets to win sweepcash which can be withdrawn for real money!
Are our predictions accurate?
Yes! Manifold is very well calibrated, with forecasts on average within 4 percentage points of the true probability. Our probabilities are created by users buying and selling shares of a market.
In the 2022 US midterm elections, we outperformed all other prediction market platforms and were in line with FiveThirtyEight’s performance. Many people who don't like trading still use Manifold to get reliable news.
Why should I trade?
Trading contributes to accurate answers of important, real-world questions and helps you stay more accountable as you make predictions.
Trade with S Sweepcash (𝕊) for a chance to win withdrawable cash prizes.
Get started for free! No credit card required.
What are sweepstakes markets?
There are two types of markets on Manifold: play money and sweepstakes.
By default all markets are play money and use mana. These markets allow you to win more mana but do not award any prizes which can be cashed out.
Selected markets will have a sweepstakes toggle. These require sweepcash to participate and allow winners to withdraw any sweepcash won to real money.
As play money and sweepstakes markets are independent of each other, they may have different odds even though they share the same question and comments.
Learn more.