28 September 2010

Olympiad Size PGN

In the Introduction to the current issue of TWIC, no.829, Mark Crowther offered some thoughts on PGN and on the FIDE rating lists:-

My enjoyment of the Olympiad has somewhat been curtailed by the processes I have created for dealing with the games in this Olympiad. The official site coverage has not been at all bad but the games are capable of some improvement.

I created a master list of players from the chess-results.com list which is the site which has the correct results. However they made life hard by not giving the FIDE ID for the player and not punctuating the players' names. However once I did that I collected the results from there and compared them with the results from the games. I also added board numbers (Match followed by Board) which allowed me to find some incorrect results. I have also created a completely new pair of tags, WhiteFideId and BlackFideId.

The PGN standard is badly in need of some work on it. (FIDE would be the place for such discussions but I would have to think long and hard about contributing if they reelect Ilyumzhinov.) Its simpilicity has made the dissemination of games possible and standard. But additional tags would be helpful. If (as it would have been possible to do because of the way I process games) I had processed games in the past with black and white FideIds in it then the material would have gained a huge amount of value (although even here there are problems as unbelievably FIDE reuse the IDs of dead players, like there aren't enough numbers).

I also want to move to full names, a complex issue, the FIDE rating list, particularly in some areas such as the Indian part of the list, is a mess and as always what do you do about duplicate names? (the many Alexey Ivanovs or other such common name pairs, Hungarian names also seem to have a lot of duplicates).

Anyhow the Olympiad is the place to test some ideas because there is so much material. Comments and errors in the PGN on the website with full names and in this issue with the shortened form, are welcome. I hope to move to a better and more streamlined way of processing games. Anyway I think I made big progress and I hope in the second week I can settle down to enjoy the chess.

The TWIC PGN game scores are used to populate many downstream databases. Another important online resource is Chessgames.com, who had this to say about the Olympics on the same day Crowther made his remarks (from User Profile chessgames.com):-

Sep-27-10: We have software that automatically imports data every several times a day. Sometimes it tries to get the games from the official site, but for some events like the Olympiad we prefer to get the PGN from an indirect source, in this case from The Week in Chess (http://www.chess.co.uk/twic/twic.html). We prefer TWIC over the "real thing" because their version of the PGN is tends to be much more refined, especially when it comes to normalized name spellings, dates, and details like that.

An import nuance to understand is that this software does not recognize what round we are on; it doesn't say "OK, this should be round 9, where are the round 9 games?" It simply takes a gigantic file of PGN and says "OK, which games in here aren't already in the database?" Usually these are the games from the most recent round, but not necessarily -- if some game from round 1 suddenly became available at the end of the event, it would grab that too.

So: if TWIC currently has bad game scores for the early rounds of the Olympiad, and at the same time we embarked on a project to delete the bad games and import from the official site, eventually our import software would identify the erroneous TWIC games we deleted as "new" and end up inserting them all over again. We'd end up with duplicates like crazy (both the wrong scores and the right ones side by side).

Now if TWIC was an unreliable source -- if we weren't sure that they would fix this problem in a timely manner or perhaps not at all -- then I agree we should go elsewhere. But Mark Crowther is immaculate and his fans are quick to alert him to problems. If he hasn't fixed round 1 already I'm sure he'll get to it it soon. When that happens it's just a matter of deleting the aberrant games and then our automatic import script will take care of the rest.

So in short, if we simply wait, this problem is 99% likely to take care of itself and probably within a day or two. If we try too hard to fix it ASAP we could end up ith an even bigger mess on our hands.

As someone who has also grappled with PGN inconsistencies, although not on the scale that these giants do regularly, I wholeheartedly support any effort to improve the PGN standard, whether managed by FIDE or not.

No comments: