02 November 2020

Stockfish NNUE - Three Useful Pins

A month ago, in Stockfish NNUE - Three Early Threads (October 2020), I tried to understand the structure of the technology behind Stockfish NNUE, but ran into some basic questions. It turns out that some of my questions are answered in the Stockfish Discord pinned messages, i.e. posts that members of Stockfish Discord have flagged as particularly useful.

Many of the Discord pins are intended for Stockfish developers who need guidance on how to access various elements of the underlying software. Other pins explain the theory behind NNUE. All three of the following messages were posted in July 2020 by nodchip, who is identified by the Chessprogramming wiki page Stockfish NNUE (chessprogramming.org) as the father of the concept:-

Stockfish NNUE [is] a Stockfish branch by Hisayori Noda, aka Nodchip.

First, I wondered whether the board representation included both Kings. Indeed it does:-

A standard input feature is called HalfKP. "K" is the one-hot encoding [NB: a group of bits with a single '1'; the rest are '0'] of the position of a King. "P" is the one-hot encoding of the position and the type of a piece other than the Kings. Note that "P" consists of the friend's pieces and the opponent's pieces. "P" also contains a feature representing that a piece is captured, and removed from a game. This is "+1". The number of the dimensions of "P" is 10 * 64 + 1 = 641. HalfKP is the direct product of "K" and "P". The input feature vector represents a board state.

NNUE holds two HalfKP vectors both of White's King and the other pieces, and Black's King and the other pieces. It inputs the White's input feature vector into the upper half of the input layer, and the Black's into the lower half of the input layer in a White's turn. It inputs the Black's into the upper half of the input layer, and the White's into the lower half of the input layer.

I also wondered about the apparent inefficiency of the encoding. It is related to the 'UE' portion of the NNUE acronym, 'Updatable Efficiently'.

If we calculate the output of the neural network with a simple way, we need to execute the calculation above for each layer. It takes a long time, and the nps [NB: nodes per second] will be extremely low.

At this time, we can speed up the calculation by using a special characteristics of the feature vector. When a move is played in a position, only a few elements in the corresponding feature vector are changed. For example, if a Pawn is advanced, the element in the feature vector corresponding to the previous position of the Pawn will be set to 0, and the element corresponding to the current position will be set to 1. If we calculate only about the changed elements, we can speed up the calculation. This technique is called "difference calculation" in computer shogi.

Difference calculation can be applied only for the calculation between the input layer and the first hidden layer. Because the input vector of the second and later hidden layer are changed drastically when a move is played. But this is fine because the number of the network parameters between the input layer and the first hidden layer is very large, and the others are very small.

We can not use difference calculation if a king is moved in HalfKP. Because all the elements in the feature vector corresponding to the king will be changed.

Finally, much of the early Discord discussion centered on selecting the positions for training. This assumes some familiarity with neural network training, but will make sense to anyone who knows the basics.

Training data: The training data are FENs [NB: chess positions] generated from games where Stockfish plays itself at fixed depth or nodes. The recommended move, the searched score, and outcome of the game are recorded in each position.

Training: The network is trained on these positions, and aims to output the searched score and the predicted output of the game. (Note that the predicted outcome can be converted to/from the searched score.) As it trains, it automatically updates the network parameters after each iteration. When the training is finished because the loss cannot be lowered, you get your network.

Evaluation: In a general neural network, all the output of each layer are computed everytime. But NNUE only calculates the difference between the previous position and the current positon to calculate the output of the first hidden layer. This drastically speeds up the calculation because NNUE has almost all the network parameters between the input layer and the first hidden layer, which is why it is called an efficiently updatable neural network. That is what makes it so interesting.

Together that makes three important pieces of the NNUE implementation: the structure of the input positions, the reason for that structure, and how known positions are used. I think I'm finally starting to grasp the NNUE concept. It's only taken three and a half months!

No comments: