[MISRESOLVED BY IAN] Which of these language models will I beat at chess?

Ṁ4932

resolved Sep 16

ALL

Resolved

YES

GPT-3.5

Resolved

YES

GPT-4

Resolved

YES

GPT-4o

Resolved

YES

GPT-4o mini

Resolved

YES

o4-mini

Resolved

YES

Claude Sonnet 4

Resolved

YES

Grok 3

Resolved

YES

Grok 4

Resolved

YES

Any model released in 2025

Resolved

YES

Claude 3 Haiku

Resolved

YES

Claude 3.5 Haiku

Resolved

YES

Llama 4 Scout

Resolved

YES

Llama 4 Maverick

Resolved

YES

Grok 3 Mini

Resolved

N/A

Every model released before 2025

Resolved

N/A

Gemini 2.0 Flash

Resolved

N/A

Gemini 2.5 Flash-Lite

Resolved

N/A

GPT-4.1 nano

Resolved

N/A

GPT-4.1 mini

Resolved

N/A

Claude 3 Opus

Which of these models will I beat at chess? Resolves YES if I win, NO if they win, and 50% for a draw.

Credit for this market goes to @mr_mino, who is much better at chess than I am.

I will close this market every Saturday. When it closes, I will play a game of chess against the model with the highest market price, if the model is publicly available. Otherwise, I'll move on to the model with the second-highest price, and so on. If no models on this market are available to the public, the market will reopen until one is.

During the game, I may use a chessboard to keep track of the moves. I am not playing blindfold chess. I will not use the Internet or any chess engines during the game.

On each move, I'll provide the LLM with the game state in PGN and FEN notation. If a model makes three illegal moves, it loses. Responses like Nbd2 vs. Nd2 will not count towards this. The model also loses if it attempts to use external tools or the Internet during the game. I will play white. If I make an illegal move, I lose.

An unreleased model will resolve N/A if it's clear that the model will never be released. I'll periodically add models to this market which I find interesting. Once I play a game, I'll post the PGN in the comments before resolving. Multiple answers can resolve YES.

The "every model released before X year" options resolve YES if, at any point after the start of that year, I have played and won against every listed model in this market that was released before the start of that year, and I am confident I would beat any omitted models from that time period. They resolve NO if I lose or draw against any eligible model released before that year.

The current system prompt is below. This may change over time.

“Let’s play a game of chess! I will be White; you will be Black. On each turn, I will give you the PGN and the FEN of the current position. Think as long as you like, and respond with the best move, ‘resign’ if you wish to resign, or ‘draw?’ if you wish to make a draw offer. Please do not respond with the updated PGN, etc. Also, do not use any external tools or search queries when making your decision.

If you attempt to make three illegal moves throughout the game, or if you use any external tools, the game will be adjudicated as a win for me.”

Get Ṁ1,000 play money

🏅 Top traders

#	Name	Total profit
1		Ṁ31
2		Ṁ19
3		Ṁ10
4		Ṁ8
5		Ṁ8

25 Comments

Sort by:

@mods sadly, the remaining answers on this market need to be resolved N/A unless you want to entertain the possibility of allowing Evan back in the future (same with https://manifold.markets/evan/how-many-language-models-will-i-bea)

@SimonWestlake nooooooo I can't believe we actually lost him...

bought Ṁ2 Answer #2lL8Od90nO NO

Are you keeping track of which in your opinion played the best game?

Claude Sonnet 4 lost due to the three illegal move rule. Here is the PGN:
1. c4 e5 2. g3 Nf6 3. Bg2 Bc5 4. Nf3 O-O 5. Nxe5 d6 6. Nd3 Bb6 7. Nc3 Bf5 8. Bxb7 Nbd7 9. Bxa8 Qxa8 10. O-O Qb8 11. Nf4 Re8 12. Ncd5 Nxd5 13. Nxd5 1-0

Grok 3 held out for quite a while, and played better than its successor did. It lost due to making three illegal moves. Here is the PGN: 1. c4 e5 2. g3 Nf6 3. Bg2 d5 4. cxd5 Nxd5 5. Qb3 Nb6 6. Qb5+ c6 7. Qxe5+ Be7 8. Qxg7 Bf6 9. Qh6 Qe7 10. Qe3 O-O 11. Qxe7 Bxe7 12. Nf3 Be6 13. O-O N8d7 14. d4 Rac8 15. Ng5 Bxg5 16. Bxg5 f6 17. Bh6 Rfe8 18. Nc3 Nf8 19. Ne4 Re7 20. Nxf6+ Kh8 21. Ne4 Ng6 22. Bg5 Ree8 23. Nd6 Rcd8 24. Bxd8 Rxd8 25. Nxb7 Rd7 26. Bxc6 Rxb7 27. Bxb7 Na4 28. b3 Nb6 29. Rfc1 Nd5 30. Bxd5 Bxd5 31. Rc8+ Nf8 32. Rxf8+ Kg7 33. Rd8 Be6 34. d5 Bc8 35. Rxc8 a5 36. Rac1 Kf6 37. d6 Ke6 38. Rd1 Kd7 39. Rc7+ Ke8 40. d7+ Kd8 41. Rc8+ Ke7 42. d8=Q+ 1-0

Grok 3 Mini played three illegal moves and lost. Here is the PGN:

1. c4 e5 2. g3 Nc6 3. Bg2 Nf6 4. Nc3 Bb4 5. Nf3 O-O 6. e4 d6 7. O-O Bg4 8. d3 Nd4 9. Be3 Nxf3+ 10. Bxf3 Bxf3 11. Qxf3 1-0

I wonder, are some of these bets getting into the overconfident zone... (thinking of myself as well)

Since I never got around to playing Grok 4 last week, I'm playing two models this Saturday.

Claude 3.5 Haiku played passably at first but rapidly got worse. It lost due to the three illegal move rule while facing mate in one. The PGN is 1. c4 e5 2. Nc3 Nf6 3. g3 d5 4. cxd5 Nxd5 5. Nf3 Nf6 6. Nxe5 Bb4 7. e3 O-O 8. Bc4 Bxc3 9. dxc3 Nfd7 10. Nxd7 Nxd7 11. O-O c6 12. b3 Qe7 13. a4 Nf6 14. Ba3 Bg4 15. Bxe7 Rad8 16. Bxd8 Rxd8 17. Qxd8+ 1-0

Grok 4 played worse than I thought for such a new model and resigned, also on move 17 and facing a mate in one. Here is the PGN: 1. c4 Nf6 2. Nc3 e5 3. Nf3 Nc6 4. g3 d5 5. b3 Bb4 6. cxd5 Nxd5 7. Bb2 Bxc3 8. dxc3 Be6 9. c4 Nf4 10. Qxd8+ Kxd8 11. gxf4 exf4 12. Ne5 Nxe5 13. Bxe5 Rg8 14. O-O-O+ Bd7 15. Bh3 Re8 16. Rxd7+ Kc8 17. Rxc7+ 1-0

@evan Do you think Claude deliberately made a illegal move to avoid being mated?

@JussiVilleHeiskanen Probably not, these models always have trouble when in check. Claude's only legal move was to block the check with Ne8, but Claude wanted to play Kf8 on move 17 which would have allowed its king to be captured on the next turn. If Claude wanted to end the game with an illegal move, there would certainly have been more entertaining options available.

Grok 4

Having trouble using Grok 4 through the API. Will try again later today

If Every model released before 2025 gets to the top, will the remaining models be played before models released 2025 etc.?

@JussiVilleHeiskanen Yeah, the every/any model options don't affect which models I play.

Llama 4 Maverick lost due to the three illegal move rule. Here is the PGN:

1. c4 e5 2. g3 Nf6 3. Bg2 d6 4. Nc3 g6 5. e3 Bg7 6. Nge2 O-O 7. O-O Nc6 8. b3 Re8 9. Bb2 a6 10. Re1 b5 11. cxb5 axb5 12. Nxb5 Rb8 13. Bxc6 Rb6 14. Bxe8 Qe7 15. Qc2 Nxe8 16. Na7 1-0

GPT-4o played pretty well at the start but fell off in the middlegame. It still seems stronger than any of the other models I have played so far, and its game was the first to last over thirty moves. Here is the PGN:

1. c4 e5 2. g3 Nf6 3. Bg2 Nc6 4. Nc3 Bb4 5. Nd5 O-O 6. a3 Bd6 7. Nf3 Nxd5 8. cxd5 Ne7 9. e4 c6 10. Qb3 cxd5 11. exd5 b6 12. d3 Bb7 13. Nh4 f5 14. O-O Qe8 15. d4 e4 16. Bf4 Bxf4 17. gxf4 Nxd5 18. Rac1 Qh5 19. Rc7 Qxh4 20. Rxb7 Kh8 21. Qxd5 Qxf4 22. Qxd7 Rg8 23. Rxa7 e3 24. Bxa8 e2 25. Re1 Qg4+ 26. Bg2 f4 27. Qxg4 f3 28. Bxf3 g5 29. d5 h5 30. Qd4+ Rg7 31. Qxg7#

Claude 3 Haiku resigned after 10 moves. Here is the PGN:

1. c4 e5 2. g3 Nf6 3. Bg2 d6 4. Nc3 Bg4 5. Nf3 Nc6 6. O-O a6 7. h3 h6 8. hxg4 Nxg4 9. d3 Qd7 10. Bh3 1-0

My guess is you will at some point lose focus and make an illegal move...

@JussiVilleHeiskanen Which chess client allows you to make an illegal move?

@FergusArgyll it is in the description, the program would of course not allow the attempt.

GPT-3.5 lost its queen and then the game a few moves later due to the three illegal move rule. The illegal moves were ones that would have been legal if the model hadn't been in check. I think repeatedly putting the LLM in check could turn out to be a viable strategy against the weaker models.
Here is the PGN:

1. c4 e5 2. g3 f5 3. d4 exd4 4. Qxd4 Nc6 5. Qd5 Qf6 6. Bg2 Bb4+ 7. Bd2 Bxd2+ 8. Qxd2 Nge7 9. Nc3 d6 10. Nd5 Ne5 11. Nxf6+ Kd8 12. Nd5 Nxd5 13. cxd5 Rf8 14. Qg5+ 1-0

@evan GPT-3.5 supposedly the best llm at chess 😂

@evan have you by the way considered feeding them your past games?

@JussiVilleHeiskanen I don't think it would help, a lot of these models have short context windows and the extra tokens would probably make the responses worse if anything

Llama 4 Scout resigned even though I had just made a bad move. Here is the PGN:

1. c4 e5 2. g3 Nc6 3. Bg2 Nf6 4. Nc3 d6 5. d3 Bd7 6. Bd2 Qe7 7. e4 O-O-O 8. Nf3 a6 9. O-O b5 10. cxb5 axb5 11. Nxb5 Na5 12. Na7+ Kb7 13. Bxa5 Bc6 14. Qb3+ Kxa7 15. Ng5 h6 16. Qxf7 1-0

@evan When you castle on opposite sides, all that matters is king safety. You have to go after the king aggressively. Anyway, nice win :)

🏅 Top traders

Related questions

Related questions