Which of these language models will I beat at chess?
21
Ṁ879
Jul 6
96%
Grok 4
95%
Any model released in 2025
93%
o4-mini
91%
GPT-3.5
85%
Grok 3
85%
Every model released before 2025
82%
o3
82%
DeepSeek-V4
80%
GPT-4o
76%
o1-mini
74%
Claude Sonnet 4
71%
o4
66%
GPT-4.5
66%
Gemini 2.5 Flash-Lite
66%
Llama 4 Scout
66%
Llama 4 Maverick
66%
Grok 3 Mini
60%
Claude Opus 4
59%
GPT-4.1
59%
Claude 3 Haiku

Which of these models will I beat at chess? Resolves YES if I win, NO if they win, and 50% for a draw.

Credit for this market goes to @mr_mino, who is much better at chess than I am.

This market should be interesting, as I expect that some existing models could already beat me. I have never played rated chess; I have not played a game of chess of any kind in years.

I will close this market every Saturday. When it closes, I will play a game of chess against the model with the highest market price, if the model is publicly available. Otherwise, I'll move on to the model with the second-highest price, and so on. If no models on this market are available to the public, the market will reopen until one is.

During the game, I may use a chessboard to keep track of the moves. I am not playing blindfold chess. I will not use the Internet or any chess engines during the game.

On each move, I'll provide the LLM with the game state in PGN and FEN notation. If a model makes three illegal moves, it loses. Responses like Nbd2 vs. Nd2 will not count towards this. The model also loses if it attempts to use external tools or the Internet during the game. I will play white. If I make an illegal move, I lose.

An unreleased model will resolve N/A if it's clear that the model will never be released. I'll periodically add models to this market which I find interesting. Once I play a game, I'll post the PGN in the comments before resolving. Multiple answers can resolve YES.

The "every model released before X year" options resolve YES if, at any point after the start of that year, I have played and won against every listed model in this market that was released before the start of that year, and I am confident I would beat any omitted models from that time period. They resolve NO if I lose or draw against any eligible model released before that year.

The current system prompt is below. This may change over time.

“Let’s play a game of chess! I will be White; you will be Black. On each turn, I will give you the PGN and the FEN of the current position. Think as long as you like, and respond with the best move, ‘resign’ if you wish to resign, or ‘draw?’ if you wish to make a draw offer. Please do not respond with the updated PGN, etc. Also, do not use any external tools or search queries when making your decision.

If you attempt to make three illegal moves throughout the game, or if you use any external tools, the game will be adjudicated as a win for me.”

Get Ṁ1,000 play money
Sort by:

GPT-4 never made any illegal moves, but it ignored the back-rank mate threat and lost. Here is the PGN:

1. c4 e5 2. g3 Nf6 3. d3 d5 4. cxd5 Qxd5 5. Nf3 Nc6 6. Nc3 Bb4 7. a3 Bxc3+ 8. bxc3 O-O 9. e4 Qd6 10. Be2 Bg4 11. a4 Rad8 12. Ba3 Qd7 13. Bxf8 Rxf8 14. O-O Rd8 15. Qb3 Bxf3 16. Bxf3 b6 17. h4 Na5 18. Qb5 Qxd3 19. Qxd3 Rxd3 20. Rfd1 Nb3 21. Rxd3 Nxa1 22. Rd8+ Ne8 23. Rxe8#

@traders Grok 4 hasn't been released yet, which means my next chess game will be against GPT-4. I'm busy today, so I'll play the game tomorrow.

@evan Looking more like Monday actually.

GPT-4o mini was beating me for much of the game, but started making weird blunders in the middlegame and was forfeited at the end due to the three illegal move rule. Here is the PGN:

1. c4 e5 2. g3 Nf6 3. d3 d5 4. cxd5 Nxd5 5. Nc3 Nxc3 6. bxc3 c5 7. Bg2 Nc6 8. Nf3 Be7 9. O-O O-O 10. h4 f5 11. Be3 f4 12. gxf4 exf4 13. Bxf4 Rxf4 14. h5 Bf6 15. e3 Bf5 16. exf4 Bxd3 17. Qb3+ c4 18. Qxb7 Na5 19. Qb5 Qc7 20. Rfe1 Qc5 21. Qxc5 Nb3 22. Qd5+ 1-0

  • How will you confirm whether the model used tools or web searches? This is not always apparent from the interface. ChatGPT will frequently refer to web sources without citing them, for example.

  • Will you use structured output sampling to constrain the model to only output legal moves? (If so, that should be a completely separate market - I expect current models will DNF without legality constraints)

  • What format prompt will you use if the model outputs one illegal move?

@KJW_01294
1. I will turn web browsing off for the model, if possible. I know that ChatGPT will say "searching the web" if it's using the Internet during its response, I assume the situation is similar for other models.

2. No.

3. Hmm, not sure. Maybe just "Illegal move!"

@creator What's your ELO?

@FergusArgyll Don't have one. If I did, I expect it would be novice level.

@evan Would you consider opening a chess.com or lichess account and playing ~10 games so I can get a general idea?

I'll commit to putting limit orders (or market orders if I choose) on most of the options if you do that (Has to be legit, not playing one move and resigning immediately etc.)

@FergusArgyll Yeah, for sure. Sounds fun. I'll try and do that after I get off work tonight.

@FergusArgyll I registered an account at https://lichess.org/@/EvanFour. I finished, I think, 6 games. Lots of blunders. It's getting late, but I will play more Lichess tomorrow.

@evan Wow, classical time control! Quite the commitment...

@FergusArgyll Okay, I've played enough games for the question mark next to my rating to go away.

I think I've improved, even though my rating actually went down today. I worked on my opening yesterday to reduce the number of mistakes I make in the early game. I've liked playing 1. c4 as White, but I'm not sure which move I'll use as my opening against the LLM. My openings are definitely better now, but I still lost the only game I played yesterday, although it was against a player who I suspect was using an engine (not sure because I'm new to Lichess, but was a suspicious game). I still routinely lose material due to blunders.

I think the game tomorrow will be decided by whether I can avoid a blunder in the early game. Once I get far enough into the middlegame, I predict the LLM will lose track of the pieces and start making mistakes, and it should be easy then to checkmate or just wait until it has made three illegal moves.

@evan Prediction validated!

I pushed 4o-mini to the top, I think a smart non reasoning model performs much worse than a decent reasoning model. I'll be pushing those up as the weeks go by.

I think c4 is an excellent choice against llms, I've played 1.a4 or h4 just to throw them off.

If you can keep the games +20 ish moves, I think you'll beat them all. It might be a plus that you blunder sometimes - it's definitely not in the opening book.

bought Ṁ10 o4 YES

Cool experiment!

Playing a game against that many models might take a while, I'm curious how much time you will end up spending to play all of them.

@LarsOsborne I only have to play one game per model, and with 14 models listed, even at 2 hours per game it wouldn't take any more than 30 hours total over 14 weeks = ~20 minutes a day