08 March 2018

Stockfish in a Straitjacket?

It was one of those coincidences you can never plan. Near the end of last year Houdini won TCEC Season 10 at the same time that AlphaZero appeared on the scene. I covered both of those significant computer chess events in a single post, Houdini, Komodo, Stockfish, and AlphaZero (December 2017). The first three names are the top three chess engines in the world, of roughly equal strength, but AlphaZero had apparently crushed one of the trio in a match. In my post I wrote,

We can quibble about whether the AlphaZero - Stockfish match was indeed a fair fight -- 1 GB hash size is a severe restriction -- but the final score of +28-0=72 for AlphaZero was more than convincing to all but the most vehement skeptics.

I was reminded of those words while writing my most recent post, TCEC Season 11 in Full Swing. One of the sources I consulted, without referencing it in the post, was TCEC 11: Premier Division starts (chessbase.com; February 2018). The Chessbase site is well known and well respected for its expertise in computer chess and always attracts comments from informed readers. This particular article launched a discussion on why AlphaZero wasn't participating in TCEC Season 11 and whether the AlphaZero - Stockfish match had been too heavily rigged in AlphaZero's favor. The discussion mentioned four factors that could have hurt Stockfish's performance:-

  • Restricted hash size
  • Fast time control
  • No opening book
  • No endgame tablebases

I knew that the first two points were an issue, but wasn't certain if the last two were true. I went back to the Deepmind paper that had announced AlphaZero to the world (titled 'Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm') and re-read the relevant section:-

Evaluation • To evaluate performance in chess, we used Stockfish version 8 (official Linux release) as a baseline program, using 64 CPU threads and a hash size of 1GB. [...] The Elo rating of the baseline players was anchored to publicly available values. We also measured the head-to-head performance of AlphaZero against each baseline player. Settings were chosen to correspond with computer chess tournament conditions: each player was allowed 1 minute per move, resignation was enabled for all players (-900 centipawns for 10 consecutive moves for Stockfish and Elmo, 5% winrate for AlphaZero). Pondering was disabled for all players.

Since 'publicly available [Elo] values' depend both on configuring engines properly and on a level playing field, I started to have serious concerns that this controversy was more than a quibble. What did the Stockfish developers think about the match? On Stockfish's Fishcooking forum, in a long thread titled Open letter to Google DeepMind (December 2017), the opening message said,

AlphaZero won the 100 game match against Stockfish very impressively by a total score of 28 wins and 72 draws and 0 [losses]. This translates to an Elo difference of 100. However the details of the match described in your paper show that this match might have been much closer and more interesting had it not been for some IMO rather unfair conditions.

That first post and the subsequent discussion repeated the four complaints from the Chessbase comments listed above, and added,

In the match version 8 of Stockfish was used which is now over a year old. The latest version of Stockfish is over 40 Elo stronger in fast self play.

That makes five significant objections to the conduct of the match. Later in the same Fishcooking thread, TCEC insider Nelson Hernandez wrote,

This "match" was like a boxing match where one fighter had no seconds in his corner, the referee and judges were picked by his opponent, there was no audience to validate what happened in the ring as it happened, and the post-match story was written by the opponent's hirelings. It may well be that Alpha Zero is indeed better than the latest version of Stockfish in fair test conditions. But it is almost criminal to announce very biased test results such as these, thereby rubbishing the work of hundreds of people, in order to gain some PR benefit. What the computer chess community expects is fairness and decency.

The clincher to the above discussion is that three months have passed since Deepmind's bombshell announcement, which made available only ten games from the match. None of the other 90 games have been released for dissection by the experts. AlphaZero might be a better chess engine than Stockfish, but it might also be much worse. If we can't have a match where the Stockfish developers configure their creation for its full strength, let's have the other games from the first match.

No comments: