
https://www.twitch.tv/claudeplayspokemon
Claude Plays Pokemon is a Twitch stream where the AI chatbot Claude attempts to beat Pokemon Red. Once the game is reset, all remaining answers resolve NO, even if the stream continues with a new game.
I am N/Aing anything that is annoying to resolve. If I have to pore over multiple days of twitch VODs to figure out which way an answer resolves, I am not going to bother.
Update 2025-03-25 (PST) (AI summary of creator comment): Clarification on valid run criteria:
Only the instance where Claude goes from Mt. Moon to Cerulean City counts as a valid run.
A subsequent entry into Mt. Moon that leads him to exit through the entrance and wander back to Viridian City does not count.
In cases where the second attempt deviates from the stated route, the option will resolve NO.
All answers pertaining to Gemini are ill defined. Gemini's setup is very different from Claude's. Gemini gets a large scale map automatically that gets filled in as tiles are explored, so he can see beyond the current screen. This is an external input, not something Gemini makes by itself.
This is a huge advantage, and makes the two experiments incomparable.
Gemini also gets a tile-by-tile definition of each tile, so it can see cuttable trees for example. It doesn't have to understand the images.
@Mqrius Does anyone care that Gemini is cheating? I can resolve the answers N/A if people are up in arms about this, inclined to just let its results stand though even though it has an advantage over Claude
@SaviorofPlant I dunno I have no stakes in this. Just saying it's ambiguous what would count or not. Gemini is a few steps to the right on a continuous scale to things like "A human is playing and the AI tells the human what to do" or vice versa.
This is more of a problem for an answer like this ("another AI model surpasses Claude (gym badges)"), which is about any AI model. Less of a problem with an answer like "Gemini advances further (overall; not instantaneously) than Claude at any point", where we can assume it's about the GeminiPlaysPokemon stream no matter its advantages (but it would still be better if that was explicitly stated!).
@No_uh i realized i didn't include the screenshot. i think 34 to 1 is the best ive ever come out on a bet here

[deleted]
@JoeandSeth I'm pretty sure this Claude has only made it through Mt. Moon to Cerulean once. The second time he entered Mt. Moon after Cerulean (~80 hours ago), he exited through the entrance and wandered back to Viridian City, which means this option resolves NO. Someone correct me if this is wrong.
@SaviorofPlant shouldn't resolve NO, yet, unless this was always going to resolve no, since Claude can't go back in time? Thought this was "he traverses Mt Moon, from Pewter to Cerulean, faster than the first time, at any point in the remainder of the run"
@JoeandSeth but I'll admit to only seeing him in Moon a few times (like rn) but not watching continuously to see which way he had left.
@JoeandSeth This is a good point, phrasing of the answer implies it doesn't matter whether it's his second or fourth attempt. But I'm struggling to qualify what counts as a "re-entry", as Attempt 1's 69 hours includes some time wandering around Route 3 and Route 4. Should going back to Pewter and then going back to Mt. Moon counts as a "re-entry" but going back to Route 3 not count? Could also interpret it as "time of return to Cerulean" - "time of first mt. moon entry after last reaching Cerulean". And technically Claude could leave Mt. Moon through the entrance then take Diglett Cave to Cerulean, does that count as an escape?
Tempted to N/A this option, my phrasing was atrocious
@JoeandSeth 31 hours since most recent entry after Pewter, 109 hours since first entry after Diglett Cave. Can resolve YES based on former value since it seems like most traders interpreted it that way, unless anyone has any complaints
@PestoPastel this appears to be a duplicate of a previous answer: "Run is Reset or Terminated before Claude enters Rock Tunnel"
Pidgeotto leveled up, feel free to resolve `Claude's starter is lower level than another party member at any point`:

@SaviorofPlant the idea was to have a generous but non-trivial skill threshold so no
(however if the model's second ever pokèmon is somehow a swell magikarp or another non-catch i'd probably want to resolve yes?)
@sandrone devs were never clear about where the best claude ended up, all we know is that claude has never achieved 4 badges
@SaviorofPlant It’s implicit in the graph because it mentions reaching Viridian Forest and Mt. Moon which are two other mazes required to progress. And, Lavender Town would also be mentioned if the benchmark claude ended up there
@sandrone devs have apparently been running test claudes on the side though, some of which have also beaten surge. i think it's highly unlikely any claude reached lavender but entering rock tunnel seems like it could have happened