Step 1: Building A Database of High Quality Chess Games

While chess engines are incredibly powerful, they have limited use in the opening phase of a chess game because it's very unlikely you will play an opening that has not been played before.  Game databases are more useful than engines in the opening phase because they can tell you: 

  • How common is the opening you are thinking of?
  • How successful has it been?
  • A sense of where initial opening moves are likely to lead.
  • About new and novel openings recently played by GMs 

Quality over quantity

In building your chess database, the most important criteria is the quality of the games. Specifically you want games without blunders. Let's say you are 9 moves into a game and your database tells you Nc3 has been played 5 times leading to 1 win, 3 draws and 1 loss; Bf4 has been played 3 times all resulting in draws, and Qb3 has been played 3 times resulting in 2 wins and 1 draw.  Qb3 looks like a strong move. However if the Qb3 games were all 5 minute blitz games, Qb3 could actually be a bad move but it resulted in wins because the other players made subsequent blunders due to extreme time pressure.  If all of your database games are 2 hour Grand master games, however, it's more likely that Qb3 indeed is a strong move.

Sources of quality games

1. The Archive section of This Week In Chess is one of the best sites to download games from. On a weekly basis (Monday evenings) it collects the best games from around the world.  It is relatively easy to download games of interest back to 2012 using a simple bash script (this blog will assume a Linux environment; if you are on Windows, you can install the Arch Linux app from the Microsoft Store):

Once you've downloaded the games, you can create a new database in Scid (or the similar Scid vs PC) and import all of the games using the "Import file(s) of PGN games" option under the "Database" tab.  To reduce the likelihood of having games with blunders in them, my recommendation is to use "Header Search" within Scid to do an "Or" search to eliminate games with the word "titled" (huge number of blitz games on chess.com) or "blitz" in the "Event" cell, eliminate all games where either White or Black is rated lower than 2100, and eliminate all games less than 21 half-moves. After filtering those games, select the "Delete filter games" option in the "Maintenance Window" and then select the "Compact Database" option in the Database/Maintenance tab.


2. An excellent source of very high quality games is the game archive at iccf.com.  The ICCF allows players to use chess engines, so very few games contain blunders (especially among players rated higher than 2200).  To have access to these games, one needs to be a member of the ICCF.  You can join for free, however.  They've made it very easy to download a complete archive of games and new games are added monthly. If you're going to play in the ICCF, having these games is critical to understanding lines that your opponents have already played.

3. Another source of games is the Lichess database.  The vast majority of these games are blitz games and the large data files are more difficult to deal with. While a typical month's file download will have approximately 95 to 100 million games, filtering those for games lasting longer than 16 minutes where both players are rated at least 2000 results in only about 15,000 games per month.  If you don't want to do the work yourself of downloading and filtering games, Lichess user nikonel produces a monthly file filtered to contain non-bullet games played by highly rated players.  Downloading his historical files can be an easy way to back-fill some Lichess history into your database. However, while you can filter nikonel's file to exclude blitz games, there is no way to filter it for games that are just slightly longer than blitz, but still quite short and likely to include blunders. That is because nikonel does not retain all of the header information. Thus I prefer to download and filter the Lichess files myself: 

Games you probably don't want in your database

There are other sources of free games that focus on famous historical games (e.g., games by Casablanca, Fisher, etc.) . From a pure strategy perspective, having famous old games in your database is less useful because many of their lines contain errors which have been subsequently discovered using chess engines.  

Also, while there are a significant number of computer engine vs. engine games you can download (e.g., the TCEC engine competition archive), these games all start from an unbalanced position a number of moves into the game so that one engine already has an advantage of 100 centipawns or more (100 centipawns means one side is already up one pawn). This is done because otherwise the engine games would likely result in draws. Starting from unbalanced positions makes it easier to determine which chess engine is strongest. However, these kind of starting imbalances almost never occur in GM or ICCF games. There is also only a remote chance that you will encounter positions that occur in those games's middle and endgames due to the enormous number of potential board positions that exist mathematically.

Comments

Popular Posts