From education to employment

Lesson observation: how trustworthy are graded observations?

Recently, graded lesson observations of the type used in England’s FE colleges have come under attack, but are the alarms that have been raised really justified? At first sight we may wonder if the concerns are, in reality, entirely unwarranted. So let’s spend a few moments looking into some of the details.

Widespread use of graded lesson observations has been a part of FE college life for just over two decades. They were introduced at scale shortly after colleges became incorporated and have tended to be aligned with the college inspection arrangements of the time. On the whole, graded lesson observations have changed very little over this period. The amendments to the number of grades and the grade descriptors, along with revisions to the emphasis placed on specific aspects of lessons such as planning and the use of differentiation have not altered the underpinning premise of this type of lesson observation.

It is not uncommon to find that graded lesson observations in colleges are founded on the common sense notion that a person can walk into a classroom, spend a short time watching what is happening, also looking at what is produced, and then use the information gathered to grade a teacher’s performance, or grade the overall quality of what was seen. This method can be likened to that of briefly dipping a piece of litmus paper into a solution and using the resulting colour of the paper to determine the strength of alkalinity or acidity of the solution. In the college setting, an observer dips into a lesson and uses a record of what was noticed to form a judgement as to where the teacher, or the lesson, sits within a cluster of ratings most commonly ranging between inadequate and outstanding. Since the wholesale introduction of this approach to lesson observations over twenty years ago, few systematic attempts have been made to examine its value and most crucially its trustworthiness.

We can trust the results of the litmus paper test because the method has been proven to work over and over again. But what if we can’t put the same level of trust in the results of a graded lesson observation? Indeed, do we really know if it is possible to make an accurate and consistent judgement about a teacher’s abilities or the quality of teaching using litmus-type lesson observations?

If we turn to research reports in our quest for answers, and the issue is far too important to rely largely on opinions, anecdotes and folklore, we notice some quite striking findings. What follows are some of the outcomes of a handful of relevant research studies. I believe there is merit in contemplating these if we are to get anywhere near an informed response to the questions above.

We should note at the outset that although research into classroom observation as a means of enquiry has been around for some time, studies looking at the practice of grading lesson observations are generally newer and fewer.

Nevertheless, a useful point at which to begin is the work carried out in 1987 by Donald Medley and Homer Coker who reviewed the literature of the time relating to principals’ ratings of teachers in their schools, which included those derived from the use of lesson observations. They came to the conclusion that almost all personnel decisions at that time were based on judgments which, according to the research, were only slightly more accurate than decisions based on pure chance.

Over time, and with more experience, we might expect principals to get better at rating teachers in their schools and to some extent this may have been confirmed in a study conducted by Brian Jacob and Lars Lefgren in 2008. The results of their work suggested that principals’ abilities to accurately identify effective teachers had improved marginally in that principals were generally able to identify teachers who were the most and were the least effective in their schools, but they were far less able to distinguish between teachers in the middle of this distribution.

It is important to note however, that in both of these studies the principals used a range, but not the same range, of information on which to base their judgements so we are unable to identify accurately to what extent, if at all, their judgements were influenced by the outcomes of the lesson observations that were used.

More directly useful is the work done by Michael Strong and his colleagues in 2011. They investigated whether judges from a range of backgrounds, could correctly rate teachers of known ability to raise student achievement by viewing teachers giving lessons. Astonishingly perhaps, they found that the judges, no matter how experienced, were unable to identify successful teachers. They also found that in every one of the tests completed, the judges achieved relatively high levels of agreement even though their judgements were absolutely inaccurate.

We should be aware that not all the judges taking part in the study above were trained or experienced observers. We might wonder therefore, what results would emerge from a study in which only trained observers were used to view teachers giving lessons.

Thomas Kane and Douglas Staiger can help us out in this respect. In 2012 they undertook what may be to date the largest and most in depth study of the accuracy of graded lesson observations. It involved 1,333 teachers and 8,491 lesson observations. The observations were carried out by over 900 observers who had been certified after intensive training. The frequent occurrence of inaccurate judgments when using graded lesson observations was reinforced as the study found that for a given teacher, ratings varied considerably from lesson to lesson, and for any given lesson, ratings varied from observer to observer. They also found that they could only get close to reasonable levels of consistency in ratings for a given teacher by rating four different lessons, each needing to be rated by a different observer.

There are in addition to the above a number of smaller scale studies which test the reliability and validity of specific proprietary lesson observation instruments.

Now, none of the above studies have taken place in England and attention is drawn to them because I am not aware of any local studies that have set out with the primary intention of examining the accuracy of graded lesson observations in a similar systematic way. Naturally, each of the above studies comes with its own caveats but it is not unreasonable to anticipate that similar findings would emerge from similar studies in England if they were to be carried out to similar standards.

These research reports alert us to some of the significant issues that arise when graded lesson observations are used in an attempt to judge the performance of a teacher or judge the quality of a lesson. The results fly in the face of the common sense view that one-off graded lesson observations are both valuable and trustworthy.

Perhaps most importantly they prompt us to reconsider the place of graded lesson observations in a college’s arrangements for quality improvement, performance management, staff capability and staff development. They invite us to question very seriously, whether we can rely on the results that emanate from graded lesson observations to adequately inform decisions about what is best for staff and what is best for learners.

There is a strong base of evidence to suggest it is unlikely that we can put much faith in the judgements that arise from the graded lesson observations that currently take place in our FE colleges. The question at the forefront of minds might now be; do we have a professional responsibility to discontinue observing classroom activity in the way that we have predominantly done for the last two decades?

Terry Pearson is a former FE senior manager with experience of observing teaching in a wide variety of settings and developing effective systems for lesson observation. He now works as an independent education consultant. He can be tweeted @TPLTD or emailed at [email protected]


Jacob, B., & Lefgren, L. (2008). Can principals identify effective teachers? Evidence on subjective performance evaluation in education. Journal of Labor Economics, 26(1), 101-136.
Kane, T.J. & Staiger, D.O. ( 2012). Gathering Feedback for Teaching: Combining High-Quality Observation with Student Surveys and Achievement Gains. MET Project Research Paper. Seattle, WA: Bill & Melinda Gates Foundation.
Medley, D. M., & Coker, H. (1987). The accuracy of principals’ judgments of teacher performance. Journal of Educational Research, 80(4), 242-247.
Strong, M., Gargani, J., & Hacifazlio?lu, Ö. (2011). Do we know a successful teacher when we see one? Experiments in the identification of effective teachers. Journal of Teacher Education, 62(4), 367-382.

Related Articles

Promises, Possibilities & Political Futures…

Tristan Arnison discusses the main UK parties’ education policies for the upcoming election. While specifics vary, common themes emerge around curriculum reform, skills training, and…