Skip to main content

Test reliability

3 min read

This post is for those of you who set your own assessments. Which I guess we all have to sooner or later!

There are all sorts of 'best practices' you can read about test reliability in large-scale, 'standardised' tests (I use 'standardised' here in the true sense, i.e. not exclusively Multiple Choice Questions). As usual, though, I will concentrate on what is practical to do within the context of the classroom.

I want to start with MCQs in fact, because we usually think of them, or any other sort of dichotomously scored items (T/F, matching, etc), as being the most reliable. However, if you've read my post on rater reliability, you'll recall that the validity of such items can be questionable (i.e. do they really test what we want to test?) They can also be unreliable in unexpected ways. For instance, it's quite common to find MCQ items with more than one correct answer, and so students who choose the 'unofficial' correct answer end up being marked wrong. This might not always be apparent to us as test writers, so it's a good idea to get colleagues to check. Having been test takers ourselves, we must know too that it's all too tempting to tikam (i.e. guess) when we don't know the right answer to an MCQ. So your student might get the right answer by sheer luck. Sometimes they guess the right answer because of irrelevant clues (e.g. it's longer/shorter than the other options).

MCQs aren't necessarily bad items, but they do require a lot of time and effort to design well, and should be avoided unless you are willing to invest both. Perhaps you are designing a large scale test that you want to be able to mark quickly, and will build up a test bank of recyclable test items over time. There are lots of good advice out there for MCQ test designers.

This work by gulia.forsythe is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Generic Licence

If we go with subjectively scored items such as essays, there is likely to be rater unreliability. We already know that these can be minimised, though, and I tend to think that time on this is better spent than time on designing good MCQs, in most classroom language assessment situations. Such test items can also be badly designed though. They can be ambiguous in the way they are written, such that a student who may know their stuff doesn't actually give you what you thought you were asking for. Again, getting the help of colleagues to check the items is useful.

Brown and Abeywickrama (2010) offer some other tips to enhancing test reliability. Don't make the test too long, because while tests can be too short to reliably measure proficiency, they can also be so long that they cause fatigue in test takers. They also point out that some people (like me!) don't cope well with the stress of timed tests.

I'll stop here but if you have something to say about test reliability, please tweet it with .

Test administration reliability

2 min read

After a rather technical topic last week, we're looking now at something that's perhaps more mundane: test administration reliability. (In case you are wondering, I'm following the order in Brown & Abeywickrama, 2010, out of sheer convenience.) This will be a super short post.

This reliability is basically about test conditions. Is the room too hot/cold? Clean? Are the tables and chairs of the right height? Comfortable? Is the room well and evenly lit? Are there distractions, like noise?

If it's a pen and paper test, are the question papers printed clearly? If media is used, is the audio/video clear and of good quality? Generally, does the technology used work as it should? Do the devices run smoothly? Is the projector in good condition (i.e. image not dim or distorted)? Can everyone see/hear the media equally well? Is the internet connection fast and stable? Is there a backup plan should something fail to work?

Essentially, is there anything about the test conditions that would prevent students from doing their best?

These are issues that may take a bit of time to iron out, but are actually relatively easy to take care of. As always, if you have comments or questions, please tweet with .

This work by Moving Mountains Trust is licensed under a Creative Commons Attribution 2.0 Generic Licence.

Rater reliability

5 min read

Last week I blogged about student-related reliability. This week we are tackling something a bit more technical, but I know also of great interest to many teachers: rater reliability. I'm going to cover the less technical aspects of this first, i.e. nothing involving stats. But if stats is your thing, read till the end.

In everyday terms, rater reliability is something that we are concerned with when two (or more) markers mark the same test and one is inevitably stricter than the other (=inter-rater reliability). It's possible then that the same script marked by different markers would get different scores. Not only is this unfair to the student, but it becomes difficult to get an accurate picture of how the cohort is doing as a whole. Rater reliability can also be problematic when only one marker is involved, because we are not always consistent in the way we mark (=intra-rater reliability). For instance, marker's fatigue can affect the consistency of our judgement if we mark too many scripts in one sitting.

Obviously this is not a problem if the test is multiple-choice, true/false, or any other item type where is answer is either right or wrong (i.e. dichotomous). But in language assessment it is generally considered less than valid to assess proficiency in this way only, especially when assessing productive skills (speaking and writing). This is a classic case of the tension between validity and reliability: while there is no validity without reliability, it is possible to sacrifice validity if we pursue perfect reliability. As this is undesirable, the solution is to try and maximise rater reliability, and be always conscious of the fact that this cannot be absolutely consistent as long as human judgement is involved. We might for instance want to give students the benefit of the doubt if their scores are borderline.

However, as responsible teachers, we should try to maximise rater reliability within what is practically possible. Here are some things we can do: 


  • Read through half the scripts without awarding scores, then go back to the beginning to mark for real. If there are multiple items in the test, mark item 1 for all the scripts before marking item 2, etc. When marking electronically, I like to keep a record of comments next to student names and award grades/scores only when I finish the lot.
  • Use an analytic rubric to mark if practical/valid. 
  • Hold standardisation meetings to make sure everyone is interpreting the rubric the same way. Pick a few scripts that exemplify a range of performances and get everyone to mark them so as to check how well they are aligned in their judgement. Markers can mark them first at home before the meeting if scripts are electronic/scanned.
  • Before marking starts, the teacher-in-charge can pull out a few scripts across a range of performances from each teacher's pile, copy them and mark them herself. When the teachers finish marking them, the 2 sets of scores can be compared. Moderation of scores might be necessary if there's a major discrepancy. Obviously this is a lot easier if the scripts are electronic.


The above strategies can easily be applied to speaking tests, given the ease of recording and copying audio these days. They are obviously not problematic if we are marking digital artefacts of any kind (e.g. a video, a blog post). If you know of any other good strategies, please share them on Twitter with .

Okay now for the stats. So I'm by no means a statistician, merely a user of quantitative methods. If you have the time and interest to investigate reliability (any sort, not just rater) statistically, you might like to give this a go because it really isn't very difficult even if you have an aversion to numbers, like me :)

The easiest and most accessible way I think to check reliability is to examine correlation between 2 sets of scores using Excel or similar. There are correlation calculators online too but they can be awkward to use if your dataset is big. Of course if you have a statistics package like SPSS, that is very convenient, and you can even use it to calculate Cronbach's alpha, which is an alternative to correlation. In both cases, the higher the figure you get, the better.

The handy video below shows you how to calculate the correlation statistic Spearman's rho with Excel (check out the creator's site for his Excel files). Due to the nature of rater scores, I think Spearman's rho is more likely to be suitable than the alternative Pearson r, but the PEARSON function is built into Excel so it's even easier to calculate if you want to.

Maybe you are already familiar with correlation and Cronbach's alpha. You might like to know then that calculating, in fact conceptualising, test reliability this way has its problems. However, given that I am writing this for teachers rather than test developers, I'm not going to go there in this post. If you geek out on this kind of stuff, you might like to read this paper I wrote as part of my PhD coursework. If you want a practical textbook on stats for language testing, I recommend Statistical Analyses for Language Testers by Rita Green.

As always, I welcome questions and comments (use ). (Just don't ask me about formulae, please, because they make my head spin...)

Student-related reliability

2 min read

As promised, I'm starting my regular blogging about language testing in the classroom. I'm starting with the new year, as part of 's weekly thematic tweets project.

This work by Marcin Wichary is licensed under a Creative Commons Attribution 2.0 Generic Licence.

I'm starting with the concept of reliability, but not as psychometricians and statisticians see it since that will be of limited usefulness in the classroom. Instead, we're going to look at the different aspects of reliability that teachers will be able to apply (more) easily.

We'll start with student-related reliability (Brown & Abeywickrama, 2010). If you've ever had to take a test when you're ill, tired, unhappy or otherwise having a bad day, you'll have a good idea of what this is. Such things are likely to make your test results an inaccurate measure of your true proficiency, since you are not performing at your best. Conversely, if you are an experienced test-taker, have taken many practice tests and can apply good test-taking strategies, you might well do better than a classmate who hasn't, even if you are both equally proficient.

In working to minimise student-related unreliability, it's worth thinking about the last time you've had a test. What were your sources of anxiety? How can we make sure that the test is 'biased for best' and that each student performs optimally? One thing that we should absolutely avoid is trying to be tricky or scary. Some teachers are fond of including 'trick questions' in their tests to make them more challenging, but in fact such questions often don't measure what they've set out to measure at all (i.e. not tapping on language proficiency to answer correctly). This affects the validity of the test (more on validity in later posts).

As teachers, we can also ensure that students are aware of the test format and have good test-taking strategies. This levels the playing field, promotes student confidence, and helps to ensure we are obtaining reliable information about student ability.

What are some tips and strategies you have for minimising student-related unreliability? Do you have any questions and comments regarding this post? Share them on Twitter with the hashtag .

Blogging about language assessment

3 min read

"Measure a thousand times, cut once" by Sonny Abesamis is licensed under CC BY 2.0

As I approach a major milestone in my PhD, I've been increasingly bothered by the likelihood that I won't be making use of my language assessment knowledge in my career, PhD or no. This is because of various reasons most of which has got to do with living in Singapore as a Singaporean (but we won't go there now).

One way around this would be to be freelance in this area instead. To get this off the ground, I figured I'd need to 1. let people know who I am and what I do, and 2. find out what their needs are. With that in mind I disseminated a questionnaire for local language teachers via all my usual social media channels. After posting it enough times to get folks sick of me, I got a grand total of TWO responses.

Since I've been told repeatedly by various teachers just how much they need/want this, I was both puzzled and dismayed. Whatever the reasons might be for the poor response (probably similar to reasons I alluded to in the first para), there's no real reason why I shouldn't plough ahead and see where it takes me.

So I'm starting what I hope will be a regular blog on assessment issues. I'll be focusing on language assessment, in particular as it applies to the (local) classroom context. There are plenty of resources out there if you're interested in large scale testing (and you're probably not). What I want to do in this blog is to speak as a teacher, who happens to be more knowledgeable about this stuff, to other teachers. This matters, I believe, because like it or not, the education conversation inevitably comes back to assessment.

My PhD still takes priority so this is very much a side project. I think it's good to keep writing though, especially to hone my skills of (social) science communication. Assessment does tend to come off as arcane (as if the usual academic writing wasn't bad enough) -- I don't plan for this to be more of the same. I'd like to tackle a topic a month, with a few short pieces on it spaced out. A modest ambition for somebody who's not particularly productive I think.

What should I start with? I think I'll be traditional and go with reliability. Can't go wrong right?

Talking formative assessment at e-Fiesta 2014

4 min read

This was originally posted on 8 April 2014 (Wordpress).

I was really pleased and excited when the NIE Centre for eLearning (CeL) invited me to speak at e-Fiesta 2014. CeL suggested the topic, I guess based on what they know about my research interests.

If you're interested in how assessment can be done with social media, watch the video of my talk below, courtesy of CeL. The slides can be viewed below as well.

I see the talk as an exercise in formative assessment with social media in itself, and I want to explain the thinking behind it here.

The invitation came at a time I was planning my PhD coursework essay on digital literacies and I had thoughts of rehashing some of the stuff that was going to go into my essay. Eventually, though, I realised that my goal should be to make both assessment and social (media) learning accessible to a crowd which might be ambivalent on these topics. I also wanted to make it hands-on to some degree, because there's nothing like making people give something a go while they are your captive audience. This is of course harder to manage in a lecture theatre, but also actually helps me to make my case for using social media.

I only had a maximum of 30 minutes to work with, including 10 minutes for Q & A. I decided not to try and be clever about it; it would have a bit on formative assessment, a bit on social learning, before we check out what they look like together.

There were a couple of important considerations. I had to practise some audience awareness, tap on what MOE teachers already know (activate some schema?) and work in some MOE buzzwords. I realised on hindsight that this makes the gross assumption that everyone in the audience would be MOE teachers, but I think there were enough on the day to make it work.

I also had to make sure the tech worked as frictionlessly as possible. This meant keeping the tools simple and mobile friendly, and making sure the audience could access what I wanted them to access as quickly and easily as possible. I started with customised links, and added QR codes when Rachel from CeL reminded me that those on their phones and tablets could take advantage of them. I also scheduled tweets that outlined my talk and provided links as I went along. The tweets weren't totally in sync with my talk (I should have rehearsed more), but they ensured that the audience was never totally lost and that folks 'at home' could follow along as well. They also kept my backchannel presence active while I was speaking, perhaps working to pull those monitoring the hashtag into the conversation.

The one thing I wish I did better was managing my time. I have a tendency to go 'off script', which might engage the audience more, but also results in some messiness when time is tight. But I think I succeeded in delivering a session that was engaging without being 'fluffy'. I wanted the audience to go away thinking the issues worth mulling over further and taking action on, but I didn't want the typical academic conference 'snoozefest' presentation (not that I've ever delivered one, ahem). I think the balance I struck was ok for the crowd I had. In fact, I think I actually managed to talk seriously about assessment without inserting too much impenetrable jargon LOL. Naturally, there were a million other things I wished I could have worked into the talk. Thankfully, plenty of questions and comments came in the backchannel (as I'd ask for), and I was able stay on my soapbox for much longer than 30 minutes!

I hope I managed to demonstrate in that short space of time how formative assessment can be integrated into teaching and learning, and how this can be very effectively achieved with social media. I also hope that in experiencing it for themselves as learners, the audience are more inclined to put it into practice as teachers. Lastly, I hope my session sparked some interest in assessment issues. Assessment literacy issues bother me a lot, and every time I 'talk assessment' I hope I'm helping to raise awareness, provoke important questions or otherwise plug the gap in some small way.