MJH on Grades

What is a chess grade used for?

Three things:

Determination of eligibility,
Calculating other peoples grades,
Personal uses.

The first of these, which can be taken to include selection for teams and regulations for board order in team events as well as eligibility for entry to grade limited events, requires that the grading system orders players sensibly. It is also useful that the standards associated with various grades are consistent over time so limits and regulations do not need to be changed regularly. The backward looking nature of grades effectively mean that you end up qualifying for those things that you should have been eligible for last year.

When using grades to calculate those of other players, the maxim 'garbage in, garbage out' that applies to all formulae must be borne in mind. Rogue values will mess up the system. In particular attention will need to be paid to the treatment of previously ungraded players.

Note also that if there are distinct pools of players, simply applying the same procedures to both is no guarantee that meaningful comparisons can be made between the results, since of course there is no way of ensuring that the originally allocated grades are comparable. This caveat also holds to a lesser extent when pools of players compete largely amongst themselves, with just a small proportion of their games being against from players against other pools, as it is unreasonable to suppose that the minority of 'global' games will pull the separate grading pools together. To some extent this is the position that exists in England today, with the players in leagues in some parts of the country forming pools that have minimal interaction with each other. Consequently claims that grades are not consistent across the country may not be merely a natural defensive claim by someone losing to a lower graded player from outside their normal pool, but actually contain an element of truth.

Personal uses include

target setting: 'I want to get my grade over 150',
assessment of performance over a short period of time: 'I had a good congress at Scarborough - my performance was 163',
to put into context a single result 'how could I lose to an 100?',
to provide an immediate assessment of the standard of people you have never played.

Ideally for these uses the grading system should be simple and consistent over time and geography.

Accuracy

Most players will not only be aware of the dictum 'form is temporary, class is permanant', but could also quote experiences within their own chess history to support this. Should a grade indicate form, or see through to the underlying class. The former is of little use for formal purposes since form is only temporary and liable to change over much shorter timescales than those covered by a single published grade. The latter on the other hand is fixed, so any measure of it which varies every year is clearly failing.

Some players are convinced that their preferred system is more accurate, even though they cannot say what they mean by this, nor have any idea whether the claimed extra accuracy is enough to survive any errors introduced into the process by rounding or catering for special cases. Clearly 'more accurate' cannot have the normal English meaning of closer to the correct value, as we could only tell this if we knew what the correct value was, in which case we wouldn't need a grading system anyway. Instead we are dealing with the rather slippery statistical sense that in the more accurate method the greatest error we might reasonably expect is less than the greatest error we might reasonably expect in the less accurate method. This doesn't prevent the less accurate method producing a more accurate answer in any particular case, nor does it stop the error occasionally being more than we would reasonably expect.

A brief look at me

Lets look at some performances, namely mine in season 03/04.
The chart right summarizes the (ECF) points I was allocated in each game. The first column shows I twice got between 50 and 60 points - presumably losses against people graded 100 to 110 as my own grade of 139 means draws against lower rated players would still have been rewarded with 99 points.
The idea of a grade consisting of a single number not only representing that lot, but doing so accurately is rather amusing.

A 30 game rolling grade can be found by working out your grade for the first 30 games of a season, then games 2 to 31, then 3 to 32, and so on. To some extent it could be considered to track form. However with form being such a fragile commodity perhaps a 20 or even 10 game roling grade would in some sense be a better though inevitably more erratic tracker. Below is a graph of my 30 game moving grade for 425 games played over 5 seasons. The figures below the years are my grades for those years.

(Disclaimers: The data included a small number of ungraded players whose grades I estimated myself. Further, although I played in the Derby league throughout this period, in one year they did not have their games graded - I have not removed these games from the data. Whether this causes more damage to my graph or the ECF grade calculation for me is open to discussion.)

Each point represents my performance over the previous 30 games. In order for the graph to rise the latest game must contain a better performance than the game now dropping out of the calculation. Thus it is not just at the peaks that I was playing well, but also where the graph is climbing. Similarly my play was relatively poor where the graph is falling as well as at the troughs.

I think I can make a fair claim to being basically a one forty player even though my performance levels can vary substantially from this value. Superficially this may seem to be an argument for publishing grades more frequently. However, one of the uses for grades is in the calculation of other peoples grades. For example had my grade been published as 124 in a new year list in Jan 2007, then all those people I played during my purple patch later the same season would have found there own grades suffering even more than they have done. Even an incremental grade would have taken time to adjust to my higher level of performance. As a grade is taken to represent ones playing strength through the season an interval estimate such as 140 ± 15 might be more sensible than the point estimate of 140, but somehow I doubt that the chess playing fraternity would warm to interval estimates.

You can generate a similar graph for your own results using your preferred system. I'm rather a streaky player - a more consistent one would not hit the same peaks or plumb the same depths. Nevertheless they will find that the graph still displays some sort of irregular pattern of relative peaks and troughs. If your records do not contain enough games to graph a 30 game rolling grade try a 20 or even 10 game rolling grade. The patterns you get will be more jagged with more extreme peaks and troughs - as an example my 10, 20 and 30 game rolling grades for 2003/4 are shown right.

ECF system.

(I) New grade = average grading points per game, where

grading points in a single game = opponents grade + 100 × your score (0/½/1) - 50

and opponents grade is taken to differ by at most 40 from your own.

This can be rearranged to

(II) New grade = average grade of opponent + your %age score - 50
or

(III) New grade = old grade + 100 × (total actual scores - total expected score)/N,

N being the number of games played.

and expected score = (your grade - opponents grade + 50)/100

= score needed to maintain your grade.

Chess Scotland System - an Elo based system

The Chess Scotland website contains a fairly extensive description of the procedure they adopt.

In particular their starting formula is

(IV) New grade = old grade + 800 × (total actual scores - total expected score)/N

What I gain, you lose

Clearly in any game the actual scores of the two players sum to 1, as do their expected scores, so

my score + your score = my expected score + your expected score

and hence

my score - my expected score = -(your score - your expected score)

The change in value of the bracketed expression (in either (III) or (IV)) for the two players due to a game between them will be negatives of each other. What one gains the other loses. Points are effectively transferred from one player to the other. In the ECF system the number of points transferred is of course the difference between the number of grading points awarded for the game and the players grade.

We can regard the ECF and Elo systems as examples of transfer systems, as indeed would be any other system similarly based on the concept of expected score. Not surprisingly all such systems have to confront the same problems, and their relative success as grading systems is much more down to the measures they take to address these problems than to the evaluation of expected score.

There are several differences between the Scottish formula and version (III) of the ECF formula.

Firstly the visual difference of × 800 as opposed to × 100. Clearly the effect of this is to spread grades out further in the Scottish system, and gives rise to the × 8 factor in the conversion formula ELO = 8 × ECF + 600. However it does nothing to reorder grades and so is merely a cosmetic difference.

The + 600 in the conversion formula simply aligns the two systems once the spreads have been catered for. It may well be a number chosen by design - for example were the ECF to have chosen 100 and ELO 1400 as the grade for an average player, then on the assumption that the average players playing under either grading system would be of similar standard 100 ECF needs to convert to 1400 ELO and the need for the + 600 becomes apparent. Unfortunately I have to confess that I have not delved enough into the original designs to make the assertion that this is how the +600 term arose.

Short digression: A conversion formula of FIDE = 5 × ECF + 1250 is sometimes quoted for players below 215 ECF although FIDE is an Elo based system. This is not because of problems with the ECF formula, but because FIDE grades are only awarded for performances reaching some minimum level. Consequently lower rated players only get a FIDE grade when they outperform, so that the lower FIDE grades overstate the players general standard. This conversion factor is I believe based on what works rather than any underlying theory. End digression.

Next the expected scores are defined differently. Thirdly N is different. Lastly the old grade can be considered different.

How are Elo expected scores defined

The next difference is also largely irrelevant, though that statement will cause a lot of angst amongst Elo supporters. The expected scores in the two formulae are defined differently. The ECF one I have already given - it is a truncated linear function of the difference of the two players grades. Under Elo expected scores are given by a Duckworth-Lewis type table which clearly complicates the calculation. (OK, D-L is based on observation whereas Elo is based on a statistical curve, but for most people they are both black box calculations). Graphs of the two are shown right, the horizontal scale giving the difference in grades of two players in both ELO and ECF points and the vertical axis the expected score.

The maximum difference between these functions is 0.032 for a difference of 40 ECF points. Switching to ELO expected scores within the ECF formula would for a single game make a difference of at most 100 × 0.032 = 3.2 to the numerator of the fraction in (III) and hence of less than 0.11 in the grade whenever the denominator N is at least 30 as desired. For a small number of people who consistently play against much stronger opposition (or consistently in much weaker fields) the cumulative effect over a season might be noticeable. For most players it is a refinement without effect. It is worth pointing out that any probability curve claiming to to give the expected scores has to be symmetric about (0, 0.5), and this ensures that anyone who plays a selection of opponents both weaker and stronger than themselves will get roughly the same total expected score for many such functions.

English N v Scottish N

The third difference, rarely commented upon, is in N. Needless to say this one does matter. In the English system N should be at least 30. If less than 30 games are available in the current season then a proportion of games are taken from previous years to get to 30, though calculations are never based on more than 3 years of games. The Scots also believe that calculations should be based on N being at least 30, so they define N to be the greater of the number of games played and 30. This has implications for the grades of anyone who plays fewer than 30 games in a season and so propagates throughout the system via their opponents and their opponents opponents in subsequent seasons.

These different Ns may reflect different philosophies in the approach to grading. Although (III) is equivalent to it, (I) can be regarded as the definition for an English grade. This effectively starts each year afresh saying 'What evidence have we got to base this years grade on'. Including previous years games when fewer than 30 games are available in the current year is only justified if the players level of performance is constant over the extended period, a condition not satisfied by all players even if their participation is limited.

The Scottish formula (IV) on the other hand by emphasizing the role of an old grade appears to ask the question 'What evidence do we have for a change in the players grade'. If the evidence is based on fewer than 30 games they regard it as less reliable. By using N as 30 rather than the number of games played they ensure that the change from old to new grade is a proportionally smaller amount. The Scottish approach is also equivalent to making up to the required 30 games by including phantom games that are draws against someone of your own grade.

Consider someone who plays 15 games a year and whose performance history (grade calculation making use of the actual number of games even if less than 30) is year 1: 100, year 2: 60, year 3: 90. Their ECF grades go:

Year 1: 100 (no other games to call upon).

Year 2: 80 - based on 30 games over 2 years. 'I knew I'd had a bad season'.

Year 3: 75, again based on 30 games over 2 years. 'I thought I'd played better this year, but my grades gone down again.'

Applying a Scottish N to the English formula would lead to successive reported grades of 100, 80 and 85. Had there been 30 games in each year the new grades would have been the same as the performances. With only 15 (half of 30) games to go on the difference between old and new grades is only half the difference between the old grade and the new performance.

On the other hand had their performances been 40, 80, 90 then their ECF grades go 40, 60, 85 whereas applying a Scottish N produces 40, 60, 75.
In each case the Scottish N and the English N give grades differing by 10 in the third year. Whilst the performance histories chosen may be a little extreme it does show that the choice of N can have a much larger effect on grading calculations than the choice of ECF or ELO expected result function. Which sequence of grades produced is better is a moot point. You may even believe each treatment of N is superior for one of my examples.

The Old Grade Term

Lastly their may be differences in the 'old grade' terms in formulae (III) and (IV) as this is where any differences in methodology for calculating the grades of previously ungraded players will show up. We can also regard any adjustments used such as for juniors to appear here rather than as being separate items.

Pink or Blue?

Rumour has it that white has an underlying advantage, yet grading calculations make no attempt to allow for the differing proportions of whites and blacks that players have in a season. The opening book that came with my copy of Fritz gives white scoring 57% suggesting being white gives a 14 point boost to your grade relative to being black. This should be reflected by a change in your ECF expected score of +/-0.07 depending on colour. Even allowing for whites advantage being smaller than this (which may or may not be the case) colour change is in general going to have a bigger effect on your grade calculation than a change from ECF to ELO expected results curve - averaged out over 30 games those extra 14 points for being white rather than black adds nearly an extra 0.5 to your grade.

The End is Nigh

How do your games end? By player actions alone or do you use phone a friend or ask the audience (technically known as adjournment and adjudication respectively) to complete the game?

Are the outcomes of your games ever influenced by other events - agreeing a draw in a superior position to secure a match victory, a congress prize or get to the pub before closing time? Ever pressed on in a position you would prefer to agree drawn in an attempt to secure the win that your team needs, only to find that the risks you have to take backfire and you lose? Accepting a draw in a won position will cost you up to 50 grading points as used in (I) which over thirty games will affect your final grade by upto 1.67, again drowning out the effects of changing expected probability curve from ECF to Elo. (I say upto since as anyone who doesn't use adjudications knows, not all won games end up being won.) People involved in adjournments may concede an inferior result rather than travel to play the second session needed to get the result they believe possible.

The grade calculation ignores colour, mode of finish and outside influences - only the result and the grades of the participants count. Truly grades only indicate performance in the most basic sense, not how well you have played or how strong a player you are.

What are Inflation and deflation

The average grade within a grading database should move up and down in line with the average standard of the players within it. In times of expansion one might expect the average to go down because of an influx of new players mainly of a modest standard. What happens during periods of contraction will depend on the relative defection rates at different standards. Such changes are natural and should occur. They do not represent inflation or deflation.

A grading system suffers inflation if the numerical grade associated with a fixed standard goes up over time, deflation if it goes down.

Inflation/deflation arises in two forms: average grade changes, average standard does not; average grade remains constant, average standard does not. The former is caused by what I call asymmetric transfer of points, the latter by the presence of players whose standard is rising or declining.

Non-Symmetrical Transfer

Non-symmetry of transfer introduces deflation or inflation into the system. In ECF terms a 130 beating a 110 scores 160 points so gains 30 points, the 110 loses 30 points. If say the 130 plays 60 games in the season those extra 30 points average out to increase his grade by 0.5. If the loser plays 30 games his 30 point loss averages out to reduce his grade by 1. Result deflation. Thus if more successful players are generally more active than their less successful counterparts we can expect some deflation in the system because of these non-symmetric transfers. There is no way of eradicating this sort of deflation from an averaging system, though the Scottish treatment of N will much reduce the effect. In the ECF system some published grades are calculated on as few as 10 games - if such a player was the 110 person above their defeat will have decreased their published grade by 3 points, a much larger deflationary effect.

Improving and declining players

Improving players cause deflation, those in decline cause inflation. For an improving players grade to rise, other players must fall even if they maintain their standard, since the grading process is a transfer one. Consequently the average grade associated with a fixed standard goes down - deflation. Most players have a period of improvement at the start of their career which contributes to grade deflation. On the other hand a significant proportion of players quit or greatly reduces their activity before their standard declines, so they do not make a balancing contribution of inflation later in their careers.

The majority of improvers can be found in the lower reaches of the grading system. Lower graded players can find that a significant proportion of their games are against these improvers and so their grades suffer accordingly with noticeable deflation happening here as a result. Further up the grading system more games are between people of roughly constant standard so that the effect is much reduced.

The effect of games from yesteryear

The inclusion of games from earlier seasons in the ECF calculation may turn out to be inflationary or deflationary depending on whether the players performances in those earlier seasons were better or worse than their old grade. For example if your old grade is 90, but your last years performance was 100, including some of these in your new grade calculation means putting scores of 10 more than your grade into the calculations without a balancing deficit elsewhere in the database which should normally be present in a transfer system. Inflation, The phantom draws that the Scottish treatment effectively introduce are neutral since for these both the expected and actual score are 0.5.

Fighting Inflation/Deflation

Weapons to fight this type of deflation are using adjustments such as junior increments and treating certain graded players as ungraded. The ECF appears to put its faith almost totally in junior increments, though it does treat players with negative grades as ungraded. The Scottish system also applies increments to newly graded adults, treats rapid improvers as ungraded and is prepared to apply an across the board anti-drift supplement. I don't know whether the credit for this array should go to Chess Scotland, some other graders that they took advise from, or whether Elo himself included these tools as an integral part of his system.

Does Deflation matter?

Yes. Whilst the effect of deflation is seen in averages, not everyone is affected equally. Deflation first affects the opponents of improvers and those on the wrong end of asymmetric transfer, with secondary effects on these victims opponents in subsequent years. In areas in which there are a relatively large number of active improvers deflation will be more pronounced than those in which either their are no improvers or in which improvers have a low activity. Consequently the grades in areas can move relative to each other creating the phenomenon that grades in some parts of the country reflect different standards than in others.

Tarred with the same brush

Clearly chess grading systems do not handle changing standards well, since such changes necessarily introduce an element of inflation/deflation. Far from being superior to the systems used by other activities the formulae that drive chess grades are not up to the job on their own - an array of special cases needs to be identified and treated differently in order that the scheme can do its job. This is true regardless of the expected result function used. In particular to avoid deflation improvers have to be treated as special cases, either being kept out of the system or having adjustments made for them.

The Extent of the deflation Problem

Sean Hewitt has attempted to investigate the extent of the deflation problem by examining the 2006 grades. He concludes that a correction should be applied to bring results into line with the ECF expected results function. The graph shows what the effect of applying what I shall call the Hewitt Correction to the 2006 grades. The gap between the lines shows the way in which he believes deflation is worst at the lower grading levels.

Calculating junior increments

The adjustment needs to be 'right' on average not on a per player basis but on a per game played basis ie we need a weighted average where the weights are the number of games played.

Consider two juniors each given increments of 10 points though subsequent performances suggest that one 'ought' to have had an increment of 15, the other 5. On average the increments of 10 are right. However suppose the former now plays 20 games, the latter only 10. Then 30 increments of 10 will be fed into the system to counter the deflationary effect of improving juniors. However what is needed is 20 lots of 15 and 10 of 5 - a total of 350 points instead of the 300 put in. 350/30 is nearly 12 and it is this value rather than 10 that should have been used as the uniform increment for the players in this example. Unfortunately the number of games being referred to is the number to be played in the forth-coming season, not exactly a known value!

Clearly though if there is any indication that more active juniors improve more rapidly this needs to be taken into account if we wish to avoid deflation, and the table below based on the published 2007 grading list certainly suggests there is:

played:	At most 10		11 to 20		21 to 30		more than 30		All
age	av inc	juniors	av inc	juniors	av inc	juniors	av inc	juniors	av inc	juniors	wt av
8	7.5	2	-0.25	3			23.0	3	10·4	8	18.8
9	3.0	2	3.0	5	15.0	1	38.0	1	8·2	9	11.6
10	1.6	8	8.2	7	6.2	5	21.2	12	11·2	32	17·5
11	0.8	23	5.1	12	12.4	9	15.0	17	7·3	61	12·8
12	4.9	34	6.7	29	14.7	12	15.9	37	10·1	112	14·9
13	0.2	45	6.4	24	9.3	14	18.0	35	7·9	118	15·2
14	1.4	34	6.4	28	13.6	10	14.4	33	8·0	105	13·4
15	1.9	41	3.6	26	6.5	12	9.6	32	5·0	111	7·8
16	1.8	46	5.8	37	6.4	12	8.0	22	4·7	117	6·5
17	2.2	56	5.3	28	10.6	5	6.0	25	3·8	114	4·6
totals		291		199		80		217		787

For each age group, average year-on-year grade increases have been worked out for 4 levels of activity given by the main column headings. Juniors means the number of juniors. Finally the average increase for all juniors in each age group is given - these seem fairly close to the junior increments used - plus the weighted average for each age group, calculated using the number of games played as the weights.

Faded Scales

Lets use the Scottish formula (IV) to calculate our new grade. We play our 30 games. Find our opponents grades. Look up the differences between theirs and ours to find our expected scores. We are now ready to weigh our results. But our actual scores must go up in increments of a half. Every extra half point we score gets multiplied by 800 to give 400, and divided by the number of games played - 30. Thus every extra half point increases our grade by 13 and a third - call it 13 between friends. Thus most of the markings on our 4 digit weighing machine are actually missing - only every thirteenth one is there! It is very unlikely we'll get the same grade 2 years in a row regardless of whether our standard changes. This is an example of what I call pseudo-accuracy. The answer gives the impression of an accuracy that the calculation does not justify. Obviously the more games you play the more markings will be revealed on your weighing machine, but even with 100 games only every fourth mark will be there. You can do a similar calculation with the ECF formula (III) to see how much of its scale is missing for players of differing activity.

Conclusions.

The most obvious conclusion is that grades can never be accurate in any meaningful sense. Given this, it seems that neither ECF nor Elo can claim significant superiority for formal uses such as eligibility tests or calculation of opponents new grades. For personal uses of course people will have their own preference, though some will doubtless confuse personal preference with knowledge of superiority. The 'simplicity of ECF' v 'international comparability of Elo' argument is not one that you can win with someone whose views are different than your own. Perhaps the casting vote should be given to that oft-derided quality: familiarity. So, as you have probably already guessed I'm a diehard (Definition: Diehard: one who disagrees with the changes you wish to see made). If simple to understand or stealth changes (ones whose effects are largely unseen) can be made to the ECF grading process to improve it, go ahead, but please spare me the suggestion that by switching to Elo our grades will somehow become more accurate (at doing what? - define accuracy in a grading context) or enable you to compare your strength with that of the runner-up of the Patagonian championships or someone with an internet chess grade.

Indeed teammates will often have conversations about opponents along the lines of 'He's an ex one sixties player'. This is not because they cannot remember whether the person concerned peaked at 163, 168 or even 173, though probably they can't. Rather it is indicative that from their perspective on his day said opponent is liable to be dangerous. This suggests there might be as strong a case for the ECF publishing current grades with the last digit suppressed as for moving to another system in which even more values on its scale are allocated to players within a given ability range, though the common practice of rounding Elo grades to the nearest five renders them more 3 and a bit digit grades rather than a four digit ones.

The second conclusion is that the Scottish system is more able to cope with deflationary forces. Their treatment of N reduces deflation due to asymmetric transfer and inflation due to low activity improvers, whilst their treatment of improvers is more comprehensive. The Elo expected result function and the multiplier (800 rather than 100) have no role in the fight against deflation so can be ignored. However the Scottish treatment of special cases - players playing less than 30 games a season, new players, rapid improvers - should be adopted. Note that making formula (III) the standard definition of an ECF grade with N being the larger of the number of games played and 30 doesn't stop players from using the current approach to calculate a performance over a small number of games. It does however mean the end of the messy business of including a proportion of games from previous seasons in the calculation when a player plays fewer than 30 games in the current season.

The Junior increment needs to be revisited to be made a function of both age and activity.

How grades put you off chess

And finally, back to the first paragraph. Grading systems can conspire to put some people off chess. Not just because the system shows when the player is no longer able to maintain their standard. Awareness of the effect of bad results on ones grade can make players conservative in their play, afraid to try new openings, lines or sacrifices for fear of the adverse effect on their grade if things go wrong. Regardless of whether their chess is in any meaningful sense better for the conservative approach, it can lead to lowered enjoyment and drift away from the game. Unfortunately similar pressures also arise in team chess where the desire not to let team mates down can lead to a reluctance to experiment.

Appendix - Algebra.

In these manipulations,

P = number of grading points awarded in a single game.

G = your new grade

Y = your old grade

X = your opponents old grade

A = your actual result (0/½/1)

E = your expected result, which can lie anywhere on the scale from 0 to 1.

means 'the sum of' or 'the total of'

We have