Skip to main content

Rater reliability

5 min read

Last week I blogged about student-related reliability. This week we are tackling something a bit more technical, but I know also of great interest to many teachers: rater reliability. I'm going to cover the less technical aspects of this first, i.e. nothing involving stats. But if stats is your thing, read till the end.

In everyday terms, rater reliability is something that we are concerned with when two (or more) markers mark the same test and one is inevitably stricter than the other (=inter-rater reliability). It's possible then that the same script marked by different markers would get different scores. Not only is this unfair to the student, but it becomes difficult to get an accurate picture of how the cohort is doing as a whole. Rater reliability can also be problematic when only one marker is involved, because we are not always consistent in the way we mark (=intra-rater reliability). For instance, marker's fatigue can affect the consistency of our judgement if we mark too many scripts in one sitting.

Obviously this is not a problem if the test is multiple-choice, true/false, or any other item type where is answer is either right or wrong (i.e. dichotomous). But in language assessment it is generally considered less than valid to assess proficiency in this way only, especially when assessing productive skills (speaking and writing). This is a classic case of the tension between validity and reliability: while there is no validity without reliability, it is possible to sacrifice validity if we pursue perfect reliability. As this is undesirable, the solution is to try and maximise rater reliability, and be always conscious of the fact that this cannot be absolutely consistent as long as human judgement is involved. We might for instance want to give students the benefit of the doubt if their scores are borderline.

However, as responsible teachers, we should try to maximise rater reliability within what is practically possible. Here are some things we can do: 

 

  • Read through half the scripts without awarding scores, then go back to the beginning to mark for real. If there are multiple items in the test, mark item 1 for all the scripts before marking item 2, etc. When marking electronically, I like to keep a record of comments next to student names and award grades/scores only when I finish the lot.
  • Use an analytic rubric to mark if practical/valid. 
  • Hold standardisation meetings to make sure everyone is interpreting the rubric the same way. Pick a few scripts that exemplify a range of performances and get everyone to mark them so as to check how well they are aligned in their judgement. Markers can mark them first at home before the meeting if scripts are electronic/scanned.
  • Before marking starts, the teacher-in-charge can pull out a few scripts across a range of performances from each teacher's pile, copy them and mark them herself. When the teachers finish marking them, the 2 sets of scores can be compared. Moderation of scores might be necessary if there's a major discrepancy. Obviously this is a lot easier if the scripts are electronic.

 

The above strategies can easily be applied to speaking tests, given the ease of recording and copying audio these days. They are obviously not problematic if we are marking digital artefacts of any kind (e.g. a video, a blog post). If you know of any other good strategies, please share them on Twitter with .

Okay now for the stats. So I'm by no means a statistician, merely a user of quantitative methods. If you have the time and interest to investigate reliability (any sort, not just rater) statistically, you might like to give this a go because it really isn't very difficult even if you have an aversion to numbers, like me :)

The easiest and most accessible way I think to check reliability is to examine correlation between 2 sets of scores using Excel or similar. There are correlation calculators online too but they can be awkward to use if your dataset is big. Of course if you have a statistics package like SPSS, that is very convenient, and you can even use it to calculate Cronbach's alpha, which is an alternative to correlation. In both cases, the higher the figure you get, the better.

The handy video below shows you how to calculate the correlation statistic Spearman's rho with Excel (check out the creator's site for his Excel files). Due to the nature of rater scores, I think Spearman's rho is more likely to be suitable than the alternative Pearson r, but the PEARSON function is built into Excel so it's even easier to calculate if you want to.


Maybe you are already familiar with correlation and Cronbach's alpha. You might like to know then that calculating, in fact conceptualising, test reliability this way has its problems. However, given that I am writing this for teachers rather than test developers, I'm not going to go there in this post. If you geek out on this kind of stuff, you might like to read this paper I wrote as part of my PhD coursework. If you want a practical textbook on stats for language testing, I recommend Statistical Analyses for Language Testers by Rita Green.

As always, I welcome questions and comments (use ). (Just don't ask me about formulae, please, because they make my head spin...)