Skip to main content


2 min read

Even if you are not familiar with the term, you are probably familiar with the concept of washback (commonly called backwash in educational assessment). It refers to the effects of assessment on teaching and learning, and anyone who's studied in an exam-oriented system would have experienced this.

We tend to think poorly of washback because we often think of negative washback, e.g. ignoring what's in the syllabus in favour of what will be in the exam, even if we think that the syllabus has more worthy learning outcomes. While washback can be very problematic, I think we do need to consider two things.

First, as long as high-stakes exams determine a person's educational prospects, it's pretty unfair to blame teachers (and parents and learners) for their preoccupation with preparing students for exams. I don't mean to say that teachers etc. should willingly let exams lead them by the nose, and applaud those who can look beyond exams to think and act with true education in mind. However, we would be doing our students a disservice if we didn't prepare them adequately for exams (think face validity and student-related reliability). The point is not to obsess over them and let them overrun the curriculum.

Second, washback can be positive, and we should try to leverage this. While national exams are not within our control (though we may be able to exert some subtle influence), classroom assessments are -- make sure these are aligned with our intended learning outcomes. I believe that real learning will serve students well in their exams, and that obsessive exam prepping is unnecessary.

How do you deal with washback? Let us know on Twitter with


1 min read

Authenticity is about the closeness of your assessment task to a real-world task. This seems quite straightforward, until you consider that in the real world, few tasks demand only language proficiency, and not also non-language related knowledge and competencies.

So how authentic can a language test get? Brown and Abeyrickrama (2010) list a few qualities:

  • language that is as natural as possible
  • contextualised items
  • meaningful, relevant, interesting topics (although it's worth considering that meaningful, relevant, interesting to us may not be meaningful, relevant, interesting to students)
  • some thematic organisation to items, e.g. through a storyline
  • 'real-world' tasks (which could also be questionable -- do language teachers necessarily have an accurate sense of the authenticity of tasks?) 

It's possible that what we need for optimal authenticity are 'integrated' assessment tasks that combine different subjects in the curriculum, instead of language on its own.

What do you think? What sort of authentic assessment tasks do you use? Let me know with on Twitter.


2 min read

If you've followed this series so far, you might be thinking that wow it's hard to make a test reliable and valid -- too hard!

Well actually the first principle of language assessment discussed in Brown and Abeywickrama (2010) is 'practicality'. You could design the most reliable and valid test in the world, but if it's not practical to carry out, you know it isn't going to happen the way you planned it. My take on this is that we can try our best to be reliable and valid in our assessment, but also be realistic about what is achievable given limited resources.

For instance, an elaborate rubric might be more reliable to mark with, but if it's too complex to use easily and you have a big group of students, you might not use it the way it's intended, because it just takes too much time to mark one script. As a result, reliability suffers, because different teachers end up handling the complexity of the rubric in different ways.

Another example: we know that double marking is more reliable, but we also recognise that double marking every script of every test is just not feasible. In such a case, we have to make other efforts at maximising reliability.

Having said this, I think we can sometimes think of creative ways to maximise reliability and validity while still being realistic about what is doable. Take for instance standardisation meetings, which can be a drag because they take up so much valuable time. As I mentioned before, markers can be given the scripts prior to the meeting to mark at home, or they might even discuss the scripts online (e.g. by annotating them on shared Google Docs). I believe that technology can offer ways to make test administration more reliable and valid in more effective and efficient ways, and we should not therefore immediately discard a possible measure because of its perceived impracticality.

Have you got tips and strategies to maximise reliability and validity more efficiently? Please share on !

Face validity

1 min read

So far we haven't considered the test-taker's point of view. Face validity refers to exactly this: does the test look right and fair to the student?

Of course, one might argue that students are not usually the best judge of validity. But their opinion, however flawed, can affect their performance. You want students to be confident and low in anxiety when taking a test, because you want to maximise student-related reliability, as mentioned in an earlier post.

Brown and Abeywickrama (2010) advise teachers to use:

  • a well-constructed, expected format with familiar tasks
  • tasks that can be accomplished within an allotted time limit
  • items that are clear and uncomplicated
  • directions that are crystal clear
  • tasks that have been rehearsed in their previous course work
  • tasks that relate to their course work (content validity)
  • a difficulty level that presents a reasonable challenge

(p. 35)

As always, please share your thoughts on .

Criterion validity

3 min read

This work by frankleleon is licensed under a Creative Commons Attribution 2.0 Generic Licence

Okay, so you've designed a test and you've decided that if the students reach a certain mark or grade (or meet certain criteria), they have achieved the learning outcomes you're after. But are you really sure? How can you know? This is essentially the question we aim to answer when we consider criterion validity.

We can consider two aspects of criterion validity: concurrent validity and predictive validity.

To establish concurrent validity, we assess students in another way for the same outcomes, to see for example if those who performed well in the first assessment really have that level of proficiency. In my previous post on content validity, I gave the example of an MCQ grammar test vs an oral interview speaking test, to measure grammatical accuracy in speaking. To check the concurrent validity of the MCQ test, you could administer both tests to the same group of students, and see how well the two sets of scores correlate. (This does assume you are confident of the validity of the speaking test!) In a low stakes classroom testing situation, you might not have the time to administer another test, but you could for instance call up a few students for a short talk, and check their grammatical accuracy that way. You might pick the students who are borderline passes -- this could show you whether your pass mark is justified.

As for predictive validity, this is really more important when the test scores determine the placement of the student. Singapore schools typically practise streaming and/or banding to place students with others of the same level. If the test we use to determine their placement does not have predictive validity, that means there is a good chance the student would not be successful in that group. Which kind of defeats the purpose of streaming/banding! We can't predict the future, but we can compare past and future performances. We could for instance compare the test scores of students a few months into their new placement with the test scores we used to determine their placement. If there are students who perform much better or poorer than you would reasonably expect, it's time to re-examine the original test, and probably move the students to a more suitable class too.

That's about it for criterion validity. As always, tweet your comments and questions with .

Content validity

2 min read

This work by Nevit Dilmen is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported Licence.

This week we turn to validity. It can be a tricky concept and in fact was something that took me some time to 'get' at first. The easiest approach I've found to interpreting 'validity' is to ask the question: does the test measure what it's supposed to measure?

What do we want it to measure? Do we know what we want to measure, in the first place? So often I think we plan assessment without being perfectly clear about our purpose. (Sometimes it's because our learning outcomes aren't very clear to begin with.) Brown and Abeywickrama (2010) list other qualities of a valid test but I think the above definition is enough to work with for now.

Following the same book, I'm starting with content-related validity. This is pretty straightforward: is what you want to test in the test? This might seem kind of 'duh' but it's actually a trap that's quite easy to fall into. For instance, our purpose might be to test learners' grammatical accuracy when speaking, but instead of actually getting them to speak, we set an MCQ grammar test. The former would be a direct test, while the latter (arguably) an indirect test of the same thing. Indirect tests are often used for reasons of practicality and reliability; obviously it's much easier to mark a class's MCQ test (it could even be done automatically) than to administer an individual oral test for each student.

If it really isn't possible to achieve high content validity, then we've got to look into the other validities of our test. More of those in the coming weeks. In the meantime, keep your questions and comments coming with on Twitter.

Test reliability

3 min read

This post is for those of you who set your own assessments. Which I guess we all have to sooner or later!

There are all sorts of 'best practices' you can read about test reliability in large-scale, 'standardised' tests (I use 'standardised' here in the true sense, i.e. not exclusively Multiple Choice Questions). As usual, though, I will concentrate on what is practical to do within the context of the classroom.

I want to start with MCQs in fact, because we usually think of them, or any other sort of dichotomously scored items (T/F, matching, etc), as being the most reliable. However, if you've read my post on rater reliability, you'll recall that the validity of such items can be questionable (i.e. do they really test what we want to test?) They can also be unreliable in unexpected ways. For instance, it's quite common to find MCQ items with more than one correct answer, and so students who choose the 'unofficial' correct answer end up being marked wrong. This might not always be apparent to us as test writers, so it's a good idea to get colleagues to check. Having been test takers ourselves, we must know too that it's all too tempting to tikam (i.e. guess) when we don't know the right answer to an MCQ. So your student might get the right answer by sheer luck. Sometimes they guess the right answer because of irrelevant clues (e.g. it's longer/shorter than the other options).

MCQs aren't necessarily bad items, but they do require a lot of time and effort to design well, and should be avoided unless you are willing to invest both. Perhaps you are designing a large scale test that you want to be able to mark quickly, and will build up a test bank of recyclable test items over time. There are lots of good advice out there for MCQ test designers.

This work by gulia.forsythe is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Generic Licence

If we go with subjectively scored items such as essays, there is likely to be rater unreliability. We already know that these can be minimised, though, and I tend to think that time on this is better spent than time on designing good MCQs, in most classroom language assessment situations. Such test items can also be badly designed though. They can be ambiguous in the way they are written, such that a student who may know their stuff doesn't actually give you what you thought you were asking for. Again, getting the help of colleagues to check the items is useful.

Brown and Abeywickrama (2010) offer some other tips to enhancing test reliability. Don't make the test too long, because while tests can be too short to reliably measure proficiency, they can also be so long that they cause fatigue in test takers. They also point out that some people (like me!) don't cope well with the stress of timed tests.

I'll stop here but if you have something to say about test reliability, please tweet it with .

Test administration reliability

2 min read

After a rather technical topic last week, we're looking now at something that's perhaps more mundane: test administration reliability. (In case you are wondering, I'm following the order in Brown & Abeywickrama, 2010, out of sheer convenience.) This will be a super short post.

This reliability is basically about test conditions. Is the room too hot/cold? Clean? Are the tables and chairs of the right height? Comfortable? Is the room well and evenly lit? Are there distractions, like noise?

If it's a pen and paper test, are the question papers printed clearly? If media is used, is the audio/video clear and of good quality? Generally, does the technology used work as it should? Do the devices run smoothly? Is the projector in good condition (i.e. image not dim or distorted)? Can everyone see/hear the media equally well? Is the internet connection fast and stable? Is there a backup plan should something fail to work?

Essentially, is there anything about the test conditions that would prevent students from doing their best?

These are issues that may take a bit of time to iron out, but are actually relatively easy to take care of. As always, if you have comments or questions, please tweet with .

This work by Moving Mountains Trust is licensed under a Creative Commons Attribution 2.0 Generic Licence.

Rater reliability

5 min read

Last week I blogged about student-related reliability. This week we are tackling something a bit more technical, but I know also of great interest to many teachers: rater reliability. I'm going to cover the less technical aspects of this first, i.e. nothing involving stats. But if stats is your thing, read till the end.

In everyday terms, rater reliability is something that we are concerned with when two (or more) markers mark the same test and one is inevitably stricter than the other (=inter-rater reliability). It's possible then that the same script marked by different markers would get different scores. Not only is this unfair to the student, but it becomes difficult to get an accurate picture of how the cohort is doing as a whole. Rater reliability can also be problematic when only one marker is involved, because we are not always consistent in the way we mark (=intra-rater reliability). For instance, marker's fatigue can affect the consistency of our judgement if we mark too many scripts in one sitting.

Obviously this is not a problem if the test is multiple-choice, true/false, or any other item type where is answer is either right or wrong (i.e. dichotomous). But in language assessment it is generally considered less than valid to assess proficiency in this way only, especially when assessing productive skills (speaking and writing). This is a classic case of the tension between validity and reliability: while there is no validity without reliability, it is possible to sacrifice validity if we pursue perfect reliability. As this is undesirable, the solution is to try and maximise rater reliability, and be always conscious of the fact that this cannot be absolutely consistent as long as human judgement is involved. We might for instance want to give students the benefit of the doubt if their scores are borderline.

However, as responsible teachers, we should try to maximise rater reliability within what is practically possible. Here are some things we can do: 


  • Read through half the scripts without awarding scores, then go back to the beginning to mark for real. If there are multiple items in the test, mark item 1 for all the scripts before marking item 2, etc. When marking electronically, I like to keep a record of comments next to student names and award grades/scores only when I finish the lot.
  • Use an analytic rubric to mark if practical/valid. 
  • Hold standardisation meetings to make sure everyone is interpreting the rubric the same way. Pick a few scripts that exemplify a range of performances and get everyone to mark them so as to check how well they are aligned in their judgement. Markers can mark them first at home before the meeting if scripts are electronic/scanned.
  • Before marking starts, the teacher-in-charge can pull out a few scripts across a range of performances from each teacher's pile, copy them and mark them herself. When the teachers finish marking them, the 2 sets of scores can be compared. Moderation of scores might be necessary if there's a major discrepancy. Obviously this is a lot easier if the scripts are electronic.


The above strategies can easily be applied to speaking tests, given the ease of recording and copying audio these days. They are obviously not problematic if we are marking digital artefacts of any kind (e.g. a video, a blog post). If you know of any other good strategies, please share them on Twitter with .

Okay now for the stats. So I'm by no means a statistician, merely a user of quantitative methods. If you have the time and interest to investigate reliability (any sort, not just rater) statistically, you might like to give this a go because it really isn't very difficult even if you have an aversion to numbers, like me :)

The easiest and most accessible way I think to check reliability is to examine correlation between 2 sets of scores using Excel or similar. There are correlation calculators online too but they can be awkward to use if your dataset is big. Of course if you have a statistics package like SPSS, that is very convenient, and you can even use it to calculate Cronbach's alpha, which is an alternative to correlation. In both cases, the higher the figure you get, the better.

The handy video below shows you how to calculate the correlation statistic Spearman's rho with Excel (check out the creator's site for his Excel files). Due to the nature of rater scores, I think Spearman's rho is more likely to be suitable than the alternative Pearson r, but the PEARSON function is built into Excel so it's even easier to calculate if you want to.

Maybe you are already familiar with correlation and Cronbach's alpha. You might like to know then that calculating, in fact conceptualising, test reliability this way has its problems. However, given that I am writing this for teachers rather than test developers, I'm not going to go there in this post. If you geek out on this kind of stuff, you might like to read this paper I wrote as part of my PhD coursework. If you want a practical textbook on stats for language testing, I recommend Statistical Analyses for Language Testers by Rita Green.

As always, I welcome questions and comments (use ). (Just don't ask me about formulae, please, because they make my head spin...)

Student-related reliability

2 min read

As promised, I'm starting my regular blogging about language testing in the classroom. I'm starting with the new year, as part of 's weekly thematic tweets project.

This work by Marcin Wichary is licensed under a Creative Commons Attribution 2.0 Generic Licence.

I'm starting with the concept of reliability, but not as psychometricians and statisticians see it since that will be of limited usefulness in the classroom. Instead, we're going to look at the different aspects of reliability that teachers will be able to apply (more) easily.

We'll start with student-related reliability (Brown & Abeywickrama, 2010). If you've ever had to take a test when you're ill, tired, unhappy or otherwise having a bad day, you'll have a good idea of what this is. Such things are likely to make your test results an inaccurate measure of your true proficiency, since you are not performing at your best. Conversely, if you are an experienced test-taker, have taken many practice tests and can apply good test-taking strategies, you might well do better than a classmate who hasn't, even if you are both equally proficient.

In working to minimise student-related unreliability, it's worth thinking about the last time you've had a test. What were your sources of anxiety? How can we make sure that the test is 'biased for best' and that each student performs optimally? One thing that we should absolutely avoid is trying to be tricky or scary. Some teachers are fond of including 'trick questions' in their tests to make them more challenging, but in fact such questions often don't measure what they've set out to measure at all (i.e. not tapping on language proficiency to answer correctly). This affects the validity of the test (more on validity in later posts).

As teachers, we can also ensure that students are aware of the test format and have good test-taking strategies. This levels the playing field, promotes student confidence, and helps to ensure we are obtaining reliable information about student ability.

What are some tips and strategies you have for minimising student-related unreliability? Do you have any questions and comments regarding this post? Share them on Twitter with the hashtag .