Will GPT4/Opus report >50% score on ARC in 2024?
51
Ṁ23k
Dec 31
45%
chance

ARC is a general-purpose AI eval designed to test intelligence as opposed to memorization. https://arcprize.org/arc

This market resolves to Yes if there are public demonstrations of GPT4 or Claude Opus solving at least 50% of the ARC questions in 2024.

(note that this is separate from winning the arc price, which requires only using open-source models)

Get Ṁ1,000 play money
Sort by:
bought Ṁ100 YES

https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50-sota-on-arc-agi-with-gpt-4o#comments

I recently got to 50%[1] accuracy on the public test set for ARC-AGI by having GPT-4o generate a huge number of Python implementations of the transformation rule (around 8,000 per problem) and then selecting among these implementations based on correctness of the Python programs on the examples

Just need 1 more %

Shouldn't this already resolve YES, since the criteria says "at least 50%" not "more than 50%"? What do you think @Lonis?

Just awaiting an update here on the ARC-AGI-PUB leaderboard! https://arcprize.org/leaderboard

I’m confident Ryan’s approach can’t be counted as YES.
What he did is impressive and is a great demonstration of LLM's capabilities. However, the description of the current market mentions only the LLMs and nothing about assisting scripts. So, I think it's a huge stretch to resolve it as YES. What do you imagine when you read the question "Will X report >50% score on Y"? I'm sure it's something like "give X some Y input with instructions, then expect 50% of the output to be correct. Does it sound like the solution we have so far?
Sure, we don't have to be so technical, and I'd agree with YES even if the solution included anything of the next:

  • converting the puzzles to ASCII representations to work around weak vision

  • giving the LLM very detailed instructions in the prompt, and giving different prompts for different types of puzzles.

  • allowing multiple tries with additional prompts explaining the mistakes to the LLM

    But it's more than that. Python program generates thousands of prompts with different representations of the grid and detailed instructions on how to approach the puzzles (I could accept YES if it stopped on this point). Then, Ryan's script uses some sophisticated calculations to find the best 12 solutions and, based on those, generates new prompts that include comparisons of expected results with what GPT's code generated, explaining where LLM got short. The second stage also includes 3000 new samples and picking the best ones.
    I have nothing against this approach, but it doesn't fall under "ChatGPT solving ARC" definition. It's more like "Ryan Greenblatt's script solves ARC by generating 8000 prompts for ChatGpt ".

@SashaPomirkovany I don't think that's a fair description. AFAIU it's for sure a kind of hybrid approach, where the scripting and the LLM's intelligence work together, but it's a situation where they strongly complement each other, and bring out an unprecedentedly generalized problem solving capability out of each other. Something that neither one could have achieved before. We cannot discount this, as a simple script could never achieve this result by itself, it very heavily leans on the LLM, as vice versa.

@Sjeyua, but what's not fair about my description? I mentioned how impressive the script itself is and how it opens up LLM capabilities. I'd say your description is a bit harsh, as you called Ryan's code a "simple script." It's not a simple script but a smart, sophisticated solution.
I agree that the way I mentioned 8000 prompts may have sounded toxic. But it wasn't to criticize the solution but to emphasize that it can't count as YES for this bet.
You mentioned that it's a hybrid solution and the description does not mention hybrid solutions being allowed. If the description would say, "Automated solution scoring 50% ARC with use of ChatGPT," I wouldn't vote NO.

@SashaPomirkovany I'm sorry if I come off as harsh. I've thought a bit more about it since then and currently my position is a bit more soft, somewhere between either resolving this as YES and 50%. If only 75% resolution was possible or some such.

For the record, I did not call his script simple, I just made the assertion that a simple script couldn't have achieved this by itself. But re-reading the context, I should likely omit the word 'simple' from there altogether, as maybe no currently human-writable script could achieve 50% by itself, no matter how complex.

I have another bit of observation: even though there was (insofar as advanced) scripting, scripting is in a very important sense static. Write once, execute many times. Which means that either

1) the script by itself embodies (at least some aspect of) general intelligence to some extent

2) helps general intelligence be amplified from the LLM, even if it's dimly present there

3) the ARC-AGI challenge falls short in an important way from measuring the generality of intelligence
4) some combination of the above. Is there any other option that's possible?

And if no other option is possible, I'm once again leaning more towards resolving as YES (85%?), as those reasons are I think good reasons to resolve as YES for.

Let me know what you think of that.

75% resolution is possible

@Sjeyua

I'm sorry if I come off as harsh

No, I didn't feel like you were harsh, it was mostly a joke. Sorry

scripting is in a very important sense static. Write once, execute many times.

ARC evaluation set is also static. You can use the power of IF statements if you know beforehand what kind of puzzles you would get. But this static script would fail as soon as we replaced the static set with auto-generated puzzles of different (non-rectangular) shapes. You can verify that by checking the code

the script by itself embodies (at least some aspect of) general intelligence to some extent

Even if this statement is true, it's a strong point for the market to resolve as No. ARC is an attempt at a metric for general intelligence. If it's the script that embodies GI, then the credit for scoring 50% should go to the script and not to ChatGPT

helps general intelligence be amplified from the LLM, even if it's dimly present there

How would you prove this take? Let me start by producing a counter-take. I think that the general intelligence here is provided by human input. The script results from human intelligence generating a strategy and converting it to code. Ryan's code includes the strategy of how to get better prompts, how many tries to do, and how to choose the best results. That's what GI does: implementing, trying, and then adopting strategies to solve a problem. However, the script's "GI" is limited to the hardcoded implementation, which is just enough to solve tasks from the public evaluation set. For example, notice how the script has hardcoded logic to break the puzzles into two buckets: (1) with input and output grids sized the same and (2) sized differently. The script will fail if you give it a puzzle with a grid that is shaped like, for example, a heptagon. To adopt, you would need a human to extend the script again. In the chain of Human/Script/ChatGPT, only the human can adapt when the puzzle becomes slightly different.

the ARC-AGI challenge falls short in an important way from measuring the generality of intelligence

This could be true, but has nothing to do with this market. The question is specifically about ChatGPT/Opus getting a 50% score on ARC

@Lonis, the leaderboard won't be updated for now; as Ryan mentioned, he has to write a Kaggle notebook to follow the rules for Public leaderboard submission. That would require RAM optimization for his script and can take some time
I think you should clarify if you will count his solution as YES. I don't know why you would though, as the question is about ChatGPT solving it. Ryan's approach is program synthesis script that uses ChatGPT. ChatGPT does a significant part of the task, but in the end it won't be possible without the script that separates types of puzzles into buckets, generates prompts, runs statistical analysis on the results, makes adjustments etc.

@Lonis what are you planning to do about this discussion, given they raise potentially valid points but you also already stated this would count for a YES resolution prior, knowledge of which people such as I traded.

@vansh I'd say this poll is compromised. It doesn't provide any insight as people vote with a different understanding of the question. It also wouldn't be fair to some of the betters regardless of the resolution.

N/Aing the market seems reasonable to me

Hey! As mentioned in a previous comment, we're tracking the ARC-AGI-PUB leaderboard, which at the time of writing is at 42% https://arcprize.org/leaderboard

@Lonis ChatGpt is at 9% there. 42% is for the hybrid solution.

@Lonis You liked the post supporting N/A. Given the unclear resolution criterion, I suggest you either stick to your original decision that hybrid models such as the one here (https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50-sota-on-arc-agi-with-gpt-4o#comments) count, and then we can wait to see if such a model gets to 50%, or if you think that is not in the spirit of the market, as per the convincing arguments made above, you ask mods to N/A the market asap to return liquidity to the holders on either side.

That was an accidental like - the resolution criteria was stated to be based on the public evaluation set, which clearly accepts 'hybrid solutions' https://arcprize.org/guide#public

I don't think the NO bettors are arguing over whether the ARC prize accepts "hybrid solutions". The question is whether this market should.
"This market resolves to Yes if there are public demonstrations of GPT4 or Claude Opus solving at least 50% of the ARC questions in 2024."
I believe they are arguing over whether the "hybrid" solution above counts as "GPT4 solving" the questions, or if there is too much human assistance and manual scripting for it to count. I suppose it is in my interest for it to count, but I thought I would make the case anyway.

opened a Ṁ1,000 NO at 50% order

@Lonis is this training or test dataset?

The public evaluation set! https://arcprize.org/guide#public