From education to employment

The future of exams: two fundamental principles that must be honoured

Grade inflation strikes again

Now that this year’s school exam results have been declared, attention is focusing on what will happen in summer 2022 and beyond. Mitigating the (uneven) impact of learning loss will continue to be of major importance, and must be recognised wisely in 2022, and probably 2023 too, but the headline issue is grade inflation: some 44% of A level awards in England were A* or A, as compared to 38% in 2020 and 25% in 2019.

If exams-as-normal return in 2022 – as intended by Gavin Williamson and Nick Gibb – it is merely an administrative stroke-of-the-pen for Ofqual to choose the location of the A/B grade boundary to reinstate the grade distribution of 2019, with 25% of scripts awarded A* or A, and 75% all the other grades. To do this, however, would cause the 2022 cohort to exclaim, understandably, ‘why are we being so treated so harshly?’ To avoid that, brains are being exercised to discover ways to achieve a stricter rationing of top grades so that no one really notices, and no one complains. One approach might be to do this by stealth, so that the transition from 44% back to 25% (or wherever) takes place over several years; another is to change the structure of the grades, so obscuring comparisons.

The Government, for example, has floated the possibility of replacing the seven current A*, A… grades by numbers. The number of numeric grades cannot be seven, for that makes a before-and-after comparison too easy; and given the mind-set of the Government for more, not less, discrimination, that suggests that more numeric grades – perhaps 9 or 10 – will be introduced, not fewer.

Geoff Barton, the General Secretary of the Association of School and College Leaders, has gone further, suggesting:

But if universities want more differentiation, wouldn’t the simplest solution simply be for them to use the actual marks awarded in exams to identify the highest-performing students – the marks within, say, a grade A that show whether it’s a strong, middle or low A?

If marks are standardised on a scale from 0 to 100, this is not just 9 or 10 numeric grades, but 101, for each mark becomes its own grade. That certainly solves the comparison problem.

In contrast, Mary Curnock Cook, former Chief Executive of UCAS and currently a trustee of HEPI, goes the other way:

The more interesting opportunity that a numbered grade system would present is a potential move to single-level tests – in which a candidate could sit an exam for, say, A level Grade 8 mathematics, rather than a paper designed to accommodate a range of grade performances.

Such a finely-tuned exam does not need umpteen grades: a simple pass/fail could well suffice, or perhaps distinction/pass/fail.

Sam Freedman’s four options

Some proposals that have received wide press coverage have been put forward in a report by Sam Freedman, formerly a senior policy adviser on education to Michael Gove and now a Senior Fellow at the Institute for Government.

The core of Mr Freedman’s report is a section entitled ‘The intractable grading problem for 2022 exams’, which, referring to Ofqual’s recent consultation, opens with these words:

Ofqual does not address the problem of grading, saying it wants to see the 2021 results before doing so. It is not surprising it has left its most intractable problem to deal with later.

Indeed.

Undaunted, Mr Freedman boldly steps into the void, and, with the caveat that ‘While this is not a technically difficult problem it is ethically and politically tricky’, addresses the question of how best to determine the distribution of the 2022 grades. Having given his view of the ‘benefits’ and ‘costs’ associated with each of four options, Mr Freedman selects as his preference to peg 2022 grades to the 2020 distribution, primarily because the students of 2022 and 2020 have both suffered from the consequences of Covid-19, albeit in different ways, whereas those of 2019 and before did not; and the numbers of top grades in 2021 are just too high.

Looking further ahead, Mr Freedman then discusses ‘increasing the resilience of the exam system’, highlighting ‘comparative judgement’ as a way to deliver more reliable grades, for if this is ‘done enough times this produces a rank order that can be turned into grades that evidence suggests are more accurate than traditional marking’.

The final section discusses the destiny of GCSEs. His opinion that ‘scrapping GCSEs without a replacement is not a viable option’ is, I suggest, singularly uncontentious – to my knowledge, none of those exploring reform is seeking to leave a vacuum; rather, the debate is very much about what the best alternative might be. But Mr Freedman’s recommendation that ‘DfE should set up a review group to debate whether the ongoing disruption caused by COVID is a good opportunity for wider reform and rebalancing of secondary assessment’ might be more hotly argued.

‘Whether…’ suggests there may be some doubt as to the opportunity: to me, there is no doubt at all; on the contrary, I am dismayed that the opportunity offered in  2020 was not taken. And surely the heart of the matter is not that ‘DfE should set up a review group’, but whether DfE is listening, and paying any attention, to the many ideas that are on the table now, such as those from EDSK and Rethinking Assessment, to name just two.

But something important is missing…

Mr Freedman’s report is highly persuasive, and by presenting four options, he achieves the subtle ‘nudge’ that these are the only four from which to choose. But might there be others? What is missing?

And then it struck me that there is something missing altogether from Mr Freedman’s report. Something that I believe to be important, central. Something that I then searched for in the suggestions that have been made elsewhere over the last few weeks. I looked; but I could not find any reference to it anywhere. No mention whatsoever of just a single word. But a most significant concept.

Appeals.

The word ‘appeal’ does not feature in any of the recent discussions of the future of exams that I have read so far. That may be because appeals are regarded (or perhaps disregarded) as merely an add-on, something that happens right at the end, something of no consequence.

Yes, appeals do happen right at the end. But to me, appeals are the key feature of the whole exam system: they are the only way by which any grading errors can be discovered and corrected; by which the integrity of that system can be validated.  For the exam system is based totally on trust: when that envelope is opened, the candidate cannot know whether those grades are right or wrong. Appeals are therefore the only way to determine whether that trust is honoured. Or not.

My starting point in thinking about what the exam-system-of-the-future might look like is therefore not to be obsessed by how to control grade inflation and how to determine grade boundaries, but to approach the problem from a totally different perspective: to design an exam system in which a fair appeals process would discover no grading errors, but would confirm that essentially all the originally-awarded grades are valid, that the entire system has integrity. If this were to happen, this would prove that originally-awarded grades were fully reliable and trustworthy – which current grades are not, for Ofqual have now acknowledged that so-called ‘gold standard’ exams are only ‘reliable to one grade either way’.

Since exam grades are determined by marks, the only way a grade can be tested is by a fair re-mark, a possibility largely denied by Ofqual since 2016. So that’s the first change that must be made: to allow fair re-marks. And the second is to make appeals free so that there is no barrier, especially to state-funded schools.

But if that were to happen in 2022, the outcome would be a huge number of grade changes, as the truth of Ofqual’s statement that grades are ‘reliable to one grade either way’ becomes fully visible: in fact, if all grades were to be appealed – in England, that’s about 6 million A level, AS and GCSE grades every year – approximately 25%, around 1.5 million, would be changed.

…and something else important is missing too

Which then poses another, important, question: what also has to be in place so that all grades, as originally awarded, are reliable and trustworthy, so that there is a very high likelihood that they will be confirmed, not changed, as the result of a fair re-mark?

Mr Freedman mentions this in his report, and suggests the solution to be the use of comparative judgement. But as he (correctly) points out, this needs to be ‘done enough times’. And for a cohort of more than 700,000 GCSE English scripts, the ‘enough times’ required amount to a very big number indeed.

Which identifies something else missing from Mr Freedman’s report. A pragmatic way of delivering reliable assessments, assessments that will have a high probability of being confirmed, not changed, on a fair re-mark.

In fact, there are many ways to do this, and one in particular builds on the idea suggested by Geoff Barton – to show on the certificate not grades, but marks. But not just marks. In addition to the mark, the certificate should also show that subject’s ‘fuzziness’ – a number that indicates the likely range of marks that might have been given to that script, had it been marked by another equally qualified examiner. ‘Fuzziness’ recognises the fundamental truth that ‘it is possible for two examiners to give different but appropriate marks to the same answer’, with some subjects, such as English and History, being inherently ‘fuzzier’ than others, such as Maths and Physics.

As an example, suppose that a Biology script is given 65 marks, and that the ‘fuzziness’ of Biology has been determined as 4 marks. The certificate would therefore show an award expressed in a form such as ‘Biology: 65 (± 4)’, meaning that the script’s mark of 65 might have been as low as 61, or as high as 69, or anywhere in-between, had a different examiner marked the script, but it is most unlikely that any examiner would have given a mark higher than 69 or lower than 61.

The candidate is unhappy, and appeals; the script is fairly re-marked, say, 67. Since 67 lies within the range 61 to 69, this confirms the original assessment, for this possibility has already been taken into account in the award 65 (± 4). Only if the re-mark were greater than 69, or less than 61, would the assessment be changed – so a re-mark of 70 would result in a revised award of 70 (± 4). But if the measure of that subject’s ‘fuzziness’, 4 marks, has been determined statistically correctly (which is technically very easy to do), then the likelihood that an original assessment would be changed is very low. Which is why original assessments expressed in this way are reliable and trustworthy.

Two essential principles for any future exam system

Whatever happens in the future as regards exams, my view is that two principles are fundamental, and must be in place:

  1. Appeals must be free, and allow unfettered access to an expert second opinion.
  2. Assessments must be fully reliable and trustworthy when originally awarded: exam grades that are ‘reliable to one grade either way’ are just not reliable enough

Dennis Sherwood who has been writing about the unreliability of grades for HEPI for many years


Related Articles

Responses