
An estimated rating is fine. Filtering out illegal moves is allowed.
Update 2025-08-12 (PST) (AI summary of creator comment): - The rating standard for resolution will be FIDE rating; it may be estimated.
I just played a bit against it. 20 standard moves against a Ruy Lopez, then got confused. Asking it to relist all the moves reset it and let us play further, but it gave up a pawn for nothing 2 moves later
I simplified a bit and then it gave up a rook on move 29 (although, otherwise, I would have gotten another pawn and my position was overall better in terms of space and pawn structure). It kept getting more confused and making more and more illegal moves/blunders (for example, it left a knight hanging on c5, so I took it and it replied dxc5 even though there was no pawn on d6, so I just got a free knight). After 34 moves, I was up a rook and a knight and multiple pawns, so I declared myself the winner
@ChaosIsALadder Oh neat. I'll probably use this then, unless it takes too long to come out or seems otherwise unreliable.
@geuber Being able to plan one move ahead like that is maybe ~1100 level. 2000 is much better.
(Also that could have just happened by accident.)
@Soren Yeah I think it should. Someone who can't even move legally is clearly not "a good player" by the common-sense meaning of the term.
An estimated rating is fine.
Estimated what? Bona fide FIDE Elo? TCEC compatible computer rating? Stockfish tournament result? Lichess.com??
@mqp What's the concern? If GPT-5 is able to consistently beat people rated 2000 or above, then it resolves YES.
@IsaacKing having rating of 2000 you will only beat people rated 2000 half of the time, so you are either unclear in this comment or in the title (or do you consider 50% as consistently?)
@IsaacKing results against humans is not a good benchmark (there may not be enough games for statististically sound evaluation, for one thing).
But also, the principal question is: what Elo does this refer to? Many if not most online discussions these days refer to online rating systems, like that of chess.com or Lichess, rather than the classical FIDE Elo. And this is a very important distinction for the context of this market! The baseline Elo performance for GPT-3 (and presumably GPT-4, which actually appears to "play" worse but that is likely just a statistical fluke) is about 1700 on the CCRL scale. This would translate to easy blitz wins over most 2000 rated players on either chess.com or lichess, whose rating scales are inflated by several hundred points