From education to employment

Exam assessments can be reliable. And this is how to do it

Dennis Sherwood

In my FE News piece of 28 March, I examined the statement that GCSE, AS and A level grades are “reliable to one grade either way” (Q1059 here) – a statement made at a hearing of the Commons Education Select Committee on 2 September 2020 by Ofqual’s then Chief Regulator, Dame Glenys Stacey.

Two days later, on 30 March, there was a hearing of another Select Committee – that of the House of Lords, relating to Education for 11-16 Year Olds, at which the witnesses were Tim Oates CBE, Group Director of Assessment Research and Development at Cambridge Assessment, and recently-appointed member of Rishi Sunak’s expert advisory group on teaching maths to age 18; Sharon Hague, Senior Vice-President, Pearson School Qualifications; Gavin Busuttil-Reynaud, Director of Operations, Alpha Plus, a service organisation owned by AQA; and Dr Michelle Meadows, Associate Professor of Educational Assessment in the Department of Education at Oxford University, and formerly Ofqual’s Executive Director for Strategy, Risk and Research.

You can watch the full proceedings here, and that is well-worth doing, for many important themes were explored, mainly concerned with assessment. This blog, though, will focus on the replies to a question asked by Lord Watson of Invergowrie, inviting comment on Dame Glenys Stacey’s statement (time-stamp about 11:55:50).

Or rather, it will focus on two particular responses, both made by Dr Meadows, the first being her statement that

“It’s really important that people don’t put too much weight on any individual grade.”

“People”, presumably, applies to everyone – students, teachers, parents, employers, admissions officers… . If these “people” should not “put too much weight on any individual grade” – and it’s “really important” that they don’t – then I wonder (and I invite you to wonder too) what it is that I should do with them?

If you listen to Dr Meadows’s full response – and that’s a good thing to do, for it will put those words in full context – you will hear that Dr Meadows does not address Lord Watson’s question directly, nor refer to Dame Glenys Stacey’s statement. However, if you put those two together – grades are “reliable to one grade either way”, and “It’s really important that people don’t put too much weight on any individual grade”, then maybe a picture emerges.

So don’t be surprised if, this August, a student holding an offer of ABB but awarded ABC asks the admissions officer, “May I have my place, please?”.

How reliable can exam grades be?

The second response by Dr Meadows that I’d like to examine is this:

To actually get 100% reliability would be technically pretty much impossible without the most extraordinarily long assessments”.

What Dr Meadows appears to be saying is that the 75% reliability currently delivered is about as good as you can get, and anything better is neither possible nor feasible.

I agree that the achievement of 100% reliability is indeed impossible. But I believe that  reliabilities of, say, 99.9% or 99.99% are not only possible, but very easy to deliver too.

Why grades are unreliable

To verify that, I need to explain why grades are currently as unreliable as they are. It’s very likely you know that already, so please forgive me for “telling grandmothers…”.

Fundamentally, it’s because two different, equally qualified, examiners can legitimately give the same script different marks: one examiner (or team of examiners, if each question is marked by different person) might give a script, say, 64 marks; another, 66. Both marks are equally valid; there are no “marking errors”; everything complies with the mark scheme. This is simply a legitimate difference in academic opinion.

If grade B is all marks from 61 to 70 inclusive, both marks will result in grade B. But if the B/A grade boundary is 65/66, then the student’s certificate will show either grade B or grade A, depending on the lottery of who marked the script.

This effect isn’t rare. Ofqual’s own research, first published in 2016 and updated in 2018, shows that, across all subjects, and all exam levels, about 1 script in every 4 would have been awarded a different grade had the script been marked by a senior examiner rather than by the ‘ordinary’ – but fully qualified – examiner that actually did the marking.

This does not happen for exams structured as unambiguous multiple-choice questions, which have only one ‘right’ answer. But for questions that invite self-expression, any script can legitimately be given a range of marks, say, from 61 to 67. If this range lies totally within a single grade width, that’s fine – the grade is the same no matter who did the marking. But if the range 61 to 67 straddles (at least) one grade boundary, then the unreliability problem arises – a problem that in practice affects 1 script in every 4, more severely in some subjects (such as History) than others (such as Physics).

The problem-to-solve is therefore this: how can assessments be delivered that are robust given legitimate differences in the academic opinions of examiners?

According to Dr Meadows, this “technically pretty much impossible”.

I disagree. And here are some solutions…

One examiner – AI to the rescue

The first solution is to use a single examiner to mark all the scripts, so ensuring that the same standards are uniformly applied to every candidate. For some modern foreign languages, for which the number of scripts to be marked is small, that might be possible. But not for 700,000 scripts in GCSE English.

That assumes, however, that the examiner is a human. Enter AI. If an AI machine-learning algorithm is used, that enables any number of scripts to be marked by the same ‘examiner’, who also has the benefit of not getting tired.

There is much lively current exploration of using AI for marking, and the signs are very positive. And given the way in which ChatGPT has amazed us all with what it can do, progress will be fast. That said, there are some important conditions that must be met before human examiners pass into history – not least the requirement that AI can indeed be fully trusted, not only by the teaching profession, but by the public in general, to deliver fair results. It will also be necessary to prove that AQA-AI, Edexcel-AI and OCR-AI give identical outcomes to the same cohort of scripts in every subject – and if not, to decide who ‘wins’.

So although Dr Meadows is in fact right in saying that AI is “technically pretty much impossible” right now, at some time in the perhaps not-so-distant future, it probably will be.

It is indeed tempting to say “AI is nearly there, so let’s wait until it happens”, but to me, that’s rather unimaginative, and – in the meantime while we are waiting – does not address the injustice done by the fact that “grades are reliable to one grade either way”, yet students’ destinies are determined by that single grade shown on their certificate.

Here, then, are three possible solutions that could be implemented very quickly, if the authorities were minded to do so.

Using grades – wisely

The first is to use grades.

That might come as a surprise. For surely the problem is with grades!

Sort of. The problem is not so much with grades themselves, but with how Ofqual uses them. Grades are reliable when two conditions are fulfilled:

  • The grades widths are broad compared to the range of marks associated with any single mark for the exam subject. That ensures the number of scripts for which that range straddles any grade boundary is relatively few.
  • And for those few, a panel reviews each script to ensure that the grade as awarded is fair.

I hope that happens in HE for degrees close to each class boundary. And it could happen for school exams too – but that would require ditching those 10 GCSE grades. Do we really need them all?

Taking ‘legitimate examiner variability’ into account

Here’s another. If a script is marked 64, and the range is 61 to 67, why not show, on the certificate, something like “64 (minimum, 61; maximum, 67)”, or “64, (range 3 either way)”, or “64 ± 3”? That, after all, is the truth.

Associated with that is a change in the rules for appeals.

Suppose the script marked 64 is appealed, and fairly re-marked 66.

Since the re-mark, 66, is within the range 64 ± 3, this confirms the original award.

That’s because the original award takes into account that a re-mark by another qualified examiner is likely to be different from the original mark, 64, but is unlikely to be higher than 67 or lower than 61. Only if a re-mark is higher than 64 or lower than 61 is the original assessment changed; for any re-mark in the range 61 to 67, the original assessment is confirmed. And if that range of ± 3 marks is determined statistically wisely (which is easy to do), then there will be a very high likelihood that assessments are confirmed, not changed. That’s why assessments based on this principle can approach 99.9% – or even 99.99% – reliability.

The simplest solution

The third solution is very simple. Suppose each certificate were to show, in bold letters “THE GRADES ON THIS CERTIFICATE ARE RELIABLE ONLY TO ONE GRADE EITHER WAY”. Those, after all, are Ofqual’s own words. It’s very simple to implement too, for the only requirement is to change the print programme. Furthermore, the benefit is that it would alert all users “not to put too much weight on any individual grade”, as advised by Dr Meadows.

The way ahead

Those are just three, easily implemented, possibilities. Many more are examined in Chapters 14 and 15 of my book “Missing the Mark – Why so many school exam grades are wrong, and how to get results we can trust”.  None is “perfect”; all have implications; some are better than others. Hence the book’s recommendation that an independent, expert, panel should be convened to identify the best option – and for that option to be implemented, perhaps rather like the recently-formed ‘Beyond Ofsted’ inquiry.

As discussed in my earlier FE News piece, grades “reliable to one grade either way” just aren’t reliable enough.

Dennis
By Dennis Sherwood, campaigner for the delivery of reliable and trustworthy school exam grades

Ofqual response

“Dr Meadows is an esteemed member of the academic educational assessment community and a former Ofqual colleague. Dr Meadows’ comments build on the academic literature in this well-researched area. In addition to the comments quoted, we believe it noteworthy to highlight two further comments made by Dr Meadows at the same hearing. First, that: “This is not a failure of our GCSE system. This is the reality of assessment. It is the same around the world.” This comment provides an important and appropriate context for this issue.

Dr Meadows also reflects more generally that “…on the accumulation of evidence, partly why we can have this debate about GCSEs is because the evidence around GCSEs is so transparent” In other words, as your author does not acknowledge, it is Ofqual that has been proactive in publishing work in this area, reflecting the desire for transparency and understanding of these issues.

Dr Jo Saxton, Ofqual Chief Regulator, was asked about the claims that “1 in 4 grades is wrong” by Parliament in the autumn of 2022, and in a podcast interview in which she said: “No where in our research does it say that one in four grades could be wrong, but I’m really glad that you asked me about that. Laura [McInerney, who asked the question] because it’s a falsehood that gets trotted out and it’s a deliberate misinterpretation of a really technical piece of research. And I don’t like the way that it’s causing unnecessary anxiety to students.”  

We are aware of the approaches claimed by the author to be solutions to the handling of this challenge. We do not, however, share the same views about the simplicity and appropriateness of those solutions.”


FE News on the go…

Welcome to FE News on the go, the podcast that delivers exclusive articles from the world of further education straight to your ears.

We are experimenting with Artificial Intelligence to make our exclusive articles even more accessible while also automating the process for our team of project managers.

In each episode, our thought leaders and sector influencers will delve into the most pressing issues facing the FE sector, offering their insights and analysis on the latest news, trends, and developments.


Related Articles

Responses