Doing Your Own Engine Testing

How can you tell which chess engine is strongest? Or whether a proposed modification really makes the engine stronger? Or whether using the Syzygy table-base adds any value? Many of your questions can be answered by looking at other people's engine tests either on the Stockfish and Lc0 Discord channels, or by observing results of broadcast engine vs. engine matches (like Chess.com, TCEC, and Navs "Engine Battle" Twitch site). These tests will often be done on computers with extremely powerful (and expensive!) CPUs and GPUs. For example the Navs tests are done using two Nvidia 4090 GPUs  with a total MSRP of $3,200 for running Lc0, and a AMD 3990x with 128 threads which costs around $4,000 for running Stockfish. These tests and matches can take days (or weeks) to complete, but they typically produce reliable and repeatable results.

However, there may also be times when you want to try and test something on your own computer.  You may want to see if results others are getting  can be reproduced on your own machine. Or you may want to try a new version of Lc0 that has not been tested in conditions that match how you would actually use it. Or do your own test of Syzygy's usefulness.

The good news is it easy to set up your own engine vs. engine tests.  The bad news is that it takes a lot of time to run a test to produce statistically significant results.

Cute Chess

Cute Chess is a well established and well documented engine testing platform.  There is both a GUI version and a terminal client. I recommend downloading both. 

After you download the terminal client, you can uncompress it with:
tar -xzf ./cutechess-cli-1.3.1-linux64.tar.gz

This will create a directory "cutechess-cli" where you executed the command above.  You can move the GUI version you downloaded to that directory as well.

You can run the GUI (./Cute_Chess-1.3.1-x86_64.AppImage) if you want to set up a test and watch the moves being played out.  However the GUI is also useful for setting up an "engines.json" file that you can use with the terminal client. In the GUI, go to settings and add your engines and configure them (e.g., threads, hash, Syzygy table path, etc.).  This will automatically create an "engines.json" file in your ~/.config/cutechess/ directory.  Copy this file to the cutechess-cli directory you created above.

Running the cutechess-cli client in the terminal is straightforward. To get a feel for all of the parameters you can set, run './cutechess-cli --help'.  

Opening Books

For engine testing you will want to start from unbalanced opening positions in which one side already has an advantage. Balanced openings will produce too many draws, making it difficult to get statistically significant results.  Stefan Pohl has created a series of opening books with increasing degrees of imbalance that you can download. Each book is quite large. For example the UHO_XXL_2022_+110_+139.pgn file has approximately 253,000 openings in which White has an opening advantage between 110 and 139 centipawns (100 centipawn advantage means one side is up approximately 1 pawn). Telling cutechess-cli to sample these randomly is useful in case you need to interrupt and resume your test as the chance your resumed games will be playing the same opening is quite low given the large number of openings in the book.

Stefan's books are not the only opening books however. Others include the opening books used in prior TCEC tournaments.

Running  A Test From The Terminal

A typical command to run a test would look something like this:
./cutechess-cli -tournament gauntlet -pgnout results.pgn -wait 1000 -event 'sf vs sf-noSyzgy' -tb /run/media/hugh/data/dtz -resign movecount=3 score=300 -draw movenumber=20 movecount=6 score=25 -concurrency 1 -openings file=UHO_XXL_2022_+110_+139.pgn format=pgn policy=round order=random -repeat -recover -rounds 100 -games 2 -engine conf=sf tc=G/120+3 timemargin=200 -engine conf=sf-noSyzygy tc=G/120+3 timemargin=200 

Here we are using the Syzygy tablebases to help adjudicate wins and draws (-tb /run/media/hugh/data/dtz), we set up conditions for recognizing when a draw should be recognized, when one side must resign, we specify our opening book, the number of rounds and games per round.  With repeat, each side will play the same opening before moving on to the next opening. Our engine parameters have already been set up in "engines.json" in the same directory where we are running cutechess-cli. So we can just refer to the engines by the name used in the json file. We set the timecontrol (there are many options here; G/120+3 says the engine needs to complete all of their moves in two minutes, but they get 3 seconds more for each move they make). It's also useful to set a timemargin (in milliseconds) to prevent a timeout loss caused by the overhead required to initiate each engine). 

Once you invoke the command, the tournament starts and cutechess-cli will keep you informed of progress.

While your tournament/test is running, you can examine the finished games in your results.pgn file:

Finally, using Miguel A. Ballicora's "ordo" program, you can examine your intermediate results:

What you will be watching is the rating difference and the associated error bars. Large error bars give less confidence that your results are meaningful and more games will be needed.  The CFS(%) expresses the confidence that the engine in that line is better than the engine below it.

Time

...you will need a lot of it! If you run 200 games and each game lasts 7 minutes, you will need 23 hours and 20 minutes of computer run time to complete your test.  You can shorten the time by shortening your time control.  However, in reality most users run a chess engine more than a few seconds to analyze a position. So if you are finding results at very short time controls, it may not be applicable to anything but bullet games. Rating differences between engines tends to decrease at longer time controls as engines are allowed to do more calculations.

Reporting Your Results

If you find interesting results that you think might help engine developers or other users, you can report your results in the appropriate Discord channel. If you report results, you should also fully report all of the set-up details including your hardware, details of engine configurations, etc. so that readers will be able to understand your results.  A good test report will look something like: 




Comments