Will scaling transformers lead to a 60% score on ARC-AGI-2?

Ṁ1074

2030

91%

chance

ALL

Will any plain transformer model achieve 60% or more on ARC-AGI-2 by 2030?

The inference cost to achieve this result does not matter.

The model that achieves this result must use the same "transformer recipe" common between 2023-2025: techniques like RLHF/RLAIF/CoT/RAG/vision encoders are allowed, but any specialized components must also be made of vanilla transformer blocks; Any new inductive biases, such as tree-search, neurosymbolic logic, etc. would not qualify.

The result must be verified by at least one reputable, unaffiliated org (ARC, Epoch, OpenAI Evals, academic lab, etc.) or a publicly re-runnable result (notebook on Kaggle, etc.).

Resolution uses the ARC-AGI-2 evaluation set and scoring script as published on arcprize.org on the day this market opens. Later revisions are ignored.

Update 2025-10-12 (PST) (AI summary of creator comment): PPO, GRPO, and RLVR are allowed training methods.

Generating synthetic data using other models to train the transformer is allowed, as long as the final model follows the common transformer recipe.

#AI

#Technical AI Timelines

#OpenAI

#AI Impacts

#AGI

Get Ṁ1,000 play money

8 Comments

Sort by:

https://x.com/arcprize/status/1999182732845547795

"GPT-5.2 Pro (High) is SOTA for ARC-AGI-2, scoring 54.2% for $15.72/task."

@lumi is PPO or GRPO, or ig RLVR, allowed?

also is generating synthetic data to train this model on (using other models) but training with the common transformer recipe allowed?

@Bayesian both are allowed

bought Ṁ110 YES from 48% to 81%

@CraigDemel wanna bet more on this around market price? I can do a lot more volume

or anyone else; ping me

@Bayesian good for now, thanks!

Honestly I don't believe 60% or more on ARC-AGI-2 is truly AGI in any meaningful sense:

Humans can score 100%, not 60.

It's a single benchmark that doesn't really test the full breadth of capabilities. It's definitely possible to have a system that's good at this benchmark while being useless in other tasks.

I propose renaming the question.

Related questions

Related questions