Will the ARC Prize Foundation succeed at making a new benchmark that is easy for humans but still hard for the best AIs?

Plus

Ṁ8987

Jan 1

88%

chance

ALL

https://techcrunch.com/2025/01/08/ai-researcher-francois-chollet-is-co-founding-a-nonprofit-to-build-benchmarks-for-agi/

Specifically, this resolves YES if:

(1) A new benchmark is announced before the end of 2025; and

(2) The best AI result published within three months after the announcement is less than half of the human-level target. (For example, if human-level performance is claimed to be 80%, an AI will need to reach at least 40%.)

If multiple new benchmarks are created in 2025, this will resolve YES if condition 2 is true for any of them.

Update 2025-09-01 (PST) (AI summary of creator comment): Resolution Criteria Update:
- CPU or compute cost caps will be ignored when evaluating the AI performance.

Update 2025-03-24 (PST) (AI summary of creator comment): Human-Level Target Update:
- The human performance target is defined as 60% based on TechCrunch's report.
- Consequently, condition (2) will be evaluated as the best AI result needing to be less than half of 60% (i.e., below 30%).

Update 2025-03-24 (PST) (AI summary of creator comment): Human Performance Target and AI Score Requirement
- The market expects an average human performance of 60%, meaning the AI system must score below 30% within three months of a new benchmark announcement.

Compute Resources Ignored

Any amount of compute resources is acceptable; CPU or compute cost caps are ignored in the evaluation.

Multiple Benchmark Clause

If multiple new benchmarks are released during 2025, the market cannot resolve as NO until the end of the year, acknowledging that later benchmarks could provide additional opportunities for meeting condition (2).

#️ Technology

#AI

#Technical AI Timelines

#AI Impacts

#OpenAI

Get Ṁ1,000 play money

16 Comments

Sort by:

@traders TechCrunch reports average human performance of 60%, so this will resolve YES if no AI system using any amount of compute resources reaches a score of 30% in the next three months.

https://techcrunch.com/2025/03/24/a-new-challenging-agi-test-stumps-most-ai-models/

But note that I also added a clause for multiple new benchmarks being released this year. So this can't resolve NO until the end of the year, in case another benchmark is created that meets the criteria.

View original context

Note: There's some discussion on the ARC Prize Discord about mistakes in the tests. So far, 8/120 tasks in the public evaluation set have been updated.

I don't know yet if they plan to fix anything in the private tests.

opened a Ṁ10,000 NO at 80% order

Fill me at 80%

@Bayesian

Video gif. A flustered nun fans herself with one hand as she takes a deep breath in.

ngl i'm a coward and moved it up to 90%. 3 months isn't that long. but it's getting to 50% by EOY idk

bought Ṁ250 YES

@Bayesian if someone makes a market on 50% by EOY I'd bet against you there! (not at 80%, but at somewhere around even odds)

@Bayesian

@traders TechCrunch reports average human performance of 60%, so this will resolve YES if no AI system using any amount of compute resources reaches a score of 30% in the next three months.

https://techcrunch.com/2025/03/24/a-new-challenging-agi-test-stumps-most-ai-models/

bought Ṁ750 YES

https://x.com/arcprize/status/1904269307284230593?s=46&t=Ue9JkPtZblyMSUmHw-Ugaw

@mathvc Thanks, the clock starts today on the three months to see what the best AI result is.

I'm not sure how to define the "human performance" target though. I understand every question was solved successfully by at least two people, but I'm not sure what the average correct percentage is.

Ah, TechCrunch reports a human average of 60%, so I'll go with that.

Just fyi: it is very easy to find tasks that make current llms fail. In particular, if you give them two very big almost identical texts with 5 changes, they will fail to identify the changes.

Even worse is visual reasoning due to limited data and transformer unfriendly formats.

@mathvc o3 mostly succeeded at visual reasoning for the original ARC-AGI benchmark though. I'm curious how much harder they can make it while still keeping it easy for humans to solve.

@TimothyJohnson5c16 ARC is special kind of visual reasoning (discrete 2D grid). There are many visual tasks reasoning beyond that

ARC-AGI-1 had a $10,000 cap on compute cost. If ARC-AGI-2 has a similar cap, but a system exceeds the half-of-human target by spending more than the cap, does that still resolve YES?

@Nick6d8e Hmm, good question. I'm interested in comparing with o3's performance on ARC-AGI-1, and I understand they spent up to $1,000 per question, so I think I'll ignore the CPU cap.

Related questions

Related questions