Skip to main content


1 min read

Authenticity is about the closeness of your assessment task to a real-world task. This seems quite straightforward, until you consider that in the real world, few tasks demand only language proficiency, and not also non-language related knowledge and competencies.

So how authentic can a language test get? Brown and Abeyrickrama (2010) list a few qualities:

  • language that is as natural as possible
  • contextualised items
  • meaningful, relevant, interesting topics (although it's worth considering that meaningful, relevant, interesting to us may not be meaningful, relevant, interesting to students)
  • some thematic organisation to items, e.g. through a storyline
  • 'real-world' tasks (which could also be questionable -- do language teachers necessarily have an accurate sense of the authenticity of tasks?) 

It's possible that what we need for optimal authenticity are 'integrated' assessment tasks that combine different subjects in the curriculum, instead of language on its own.

What do you think? What sort of authentic assessment tasks do you use? Let me know with on Twitter.


2 min read

If you've followed this series so far, you might be thinking that wow it's hard to make a test reliable and valid -- too hard!

Well actually the first principle of language assessment discussed in Brown and Abeywickrama (2010) is 'practicality'. You could design the most reliable and valid test in the world, but if it's not practical to carry out, you know it isn't going to happen the way you planned it. My take on this is that we can try our best to be reliable and valid in our assessment, but also be realistic about what is achievable given limited resources.

For instance, an elaborate rubric might be more reliable to mark with, but if it's too complex to use easily and you have a big group of students, you might not use it the way it's intended, because it just takes too much time to mark one script. As a result, reliability suffers, because different teachers end up handling the complexity of the rubric in different ways.

Another example: we know that double marking is more reliable, but we also recognise that double marking every script of every test is just not feasible. In such a case, we have to make other efforts at maximising reliability.

Having said this, I think we can sometimes think of creative ways to maximise reliability and validity while still being realistic about what is doable. Take for instance standardisation meetings, which can be a drag because they take up so much valuable time. As I mentioned before, markers can be given the scripts prior to the meeting to mark at home, or they might even discuss the scripts online (e.g. by annotating them on shared Google Docs). I believe that technology can offer ways to make test administration more reliable and valid in more effective and efficient ways, and we should not therefore immediately discard a possible measure because of its perceived impracticality.

Have you got tips and strategies to maximise reliability and validity more efficiently? Please share on !

Face validity

1 min read

So far we haven't considered the test-taker's point of view. Face validity refers to exactly this: does the test look right and fair to the student?

Of course, one might argue that students are not usually the best judge of validity. But their opinion, however flawed, can affect their performance. You want students to be confident and low in anxiety when taking a test, because you want to maximise student-related reliability, as mentioned in an earlier post.

Brown and Abeywickrama (2010) advise teachers to use:

  • a well-constructed, expected format with familiar tasks
  • tasks that can be accomplished within an allotted time limit
  • items that are clear and uncomplicated
  • directions that are crystal clear
  • tasks that have been rehearsed in their previous course work
  • tasks that relate to their course work (content validity)
  • a difficulty level that presents a reasonable challenge

(p. 35)

As always, please share your thoughts on .

Construct validity

3 min read

This post is a bit challenging to write, partly because the concept of 'construct' is hard to explain (for me), and partly because construct validity is so central to discussions of validity in the literature.

When I started blogging about validity, I wrote that we can take the concept to mean asking the question 'does the test measure what it's supposed to measure?' We can now think a bit further as to what is actually being measured by tests. A test can only measure things that can be observed.

Say we are attempting to figure out a student's writing ability (maybe your typical school composition kind of writing). We can't actually directly measure your construct -- that mysterious, abstract ability called 'writing' -- but we do have an idea as to what it looks like. To try to fully assess it we might look at all the things that make up the ability we know as 'writing'. These are the kind of things that you will find in your marking rubric (they are there because we think that they are signs a person is good or bad at writing): organisation, grammar, vocabulary, punctuation, spelling, etc.

So we look at what we are measuring when we assess writing, and ask ourselves if these things do indeed comprehensively make up the ability we know as 'writing'. Is anything missing (construct underrepresentation)? Is there anything there that shouldn't be there because it has nothing to do with writing per se (construct irrelevance)? Imagine a writing test that didn't include marking for 'grammar', or one that required you to do a lot of difficult reading before you write. Certainly you can test writing in either of these ways, but you'd need to be clear as to what your construct is, how it differs from the more commonly understood construct of 'writing' and why. You could argue for a construct of reading+writing based on research findings, for example.

What I've written above is probably a gross over-simplification (maybe more so than usual). If you'd like a more technical explanation, I recommend JD Brown's article for JALT. It isn't long, and I love how it even includes, even if briefly, Messick's model of validity. This model is so important to our understanding of testing that I'm going to include here McNamara and Roever's (2006) interpretation of the model, in the hope that it might give you some food for thought over the long LNY weekend ;-)

Source: McNamara (2010)

Criterion validity

3 min read

This work by frankleleon is licensed under a Creative Commons Attribution 2.0 Generic Licence

Okay, so you've designed a test and you've decided that if the students reach a certain mark or grade (or meet certain criteria), they have achieved the learning outcomes you're after. But are you really sure? How can you know? This is essentially the question we aim to answer when we consider criterion validity.

We can consider two aspects of criterion validity: concurrent validity and predictive validity.

To establish concurrent validity, we assess students in another way for the same outcomes, to see for example if those who performed well in the first assessment really have that level of proficiency. In my previous post on content validity, I gave the example of an MCQ grammar test vs an oral interview speaking test, to measure grammatical accuracy in speaking. To check the concurrent validity of the MCQ test, you could administer both tests to the same group of students, and see how well the two sets of scores correlate. (This does assume you are confident of the validity of the speaking test!) In a low stakes classroom testing situation, you might not have the time to administer another test, but you could for instance call up a few students for a short talk, and check their grammatical accuracy that way. You might pick the students who are borderline passes -- this could show you whether your pass mark is justified.

As for predictive validity, this is really more important when the test scores determine the placement of the student. Singapore schools typically practise streaming and/or banding to place students with others of the same level. If the test we use to determine their placement does not have predictive validity, that means there is a good chance the student would not be successful in that group. Which kind of defeats the purpose of streaming/banding! We can't predict the future, but we can compare past and future performances. We could for instance compare the test scores of students a few months into their new placement with the test scores we used to determine their placement. If there are students who perform much better or poorer than you would reasonably expect, it's time to re-examine the original test, and probably move the students to a more suitable class too.

That's about it for criterion validity. As always, tweet your comments and questions with .

Content validity

2 min read

This work by Nevit Dilmen is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported Licence.

This week we turn to validity. It can be a tricky concept and in fact was something that took me some time to 'get' at first. The easiest approach I've found to interpreting 'validity' is to ask the question: does the test measure what it's supposed to measure?

What do we want it to measure? Do we know what we want to measure, in the first place? So often I think we plan assessment without being perfectly clear about our purpose. (Sometimes it's because our learning outcomes aren't very clear to begin with.) Brown and Abeywickrama (2010) list other qualities of a valid test but I think the above definition is enough to work with for now.

Following the same book, I'm starting with content-related validity. This is pretty straightforward: is what you want to test in the test? This might seem kind of 'duh' but it's actually a trap that's quite easy to fall into. For instance, our purpose might be to test learners' grammatical accuracy when speaking, but instead of actually getting them to speak, we set an MCQ grammar test. The former would be a direct test, while the latter (arguably) an indirect test of the same thing. Indirect tests are often used for reasons of practicality and reliability; obviously it's much easier to mark a class's MCQ test (it could even be done automatically) than to administer an individual oral test for each student.

If it really isn't possible to achieve high content validity, then we've got to look into the other validities of our test. More of those in the coming weeks. In the meantime, keep your questions and comments coming with on Twitter.

Test reliability

3 min read

This post is for those of you who set your own assessments. Which I guess we all have to sooner or later!

There are all sorts of 'best practices' you can read about test reliability in large-scale, 'standardised' tests (I use 'standardised' here in the true sense, i.e. not exclusively Multiple Choice Questions). As usual, though, I will concentrate on what is practical to do within the context of the classroom.

I want to start with MCQs in fact, because we usually think of them, or any other sort of dichotomously scored items (T/F, matching, etc), as being the most reliable. However, if you've read my post on rater reliability, you'll recall that the validity of such items can be questionable (i.e. do they really test what we want to test?) They can also be unreliable in unexpected ways. For instance, it's quite common to find MCQ items with more than one correct answer, and so students who choose the 'unofficial' correct answer end up being marked wrong. This might not always be apparent to us as test writers, so it's a good idea to get colleagues to check. Having been test takers ourselves, we must know too that it's all too tempting to tikam (i.e. guess) when we don't know the right answer to an MCQ. So your student might get the right answer by sheer luck. Sometimes they guess the right answer because of irrelevant clues (e.g. it's longer/shorter than the other options).

MCQs aren't necessarily bad items, but they do require a lot of time and effort to design well, and should be avoided unless you are willing to invest both. Perhaps you are designing a large scale test that you want to be able to mark quickly, and will build up a test bank of recyclable test items over time. There are lots of good advice out there for MCQ test designers.

This work by gulia.forsythe is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Generic Licence

If we go with subjectively scored items such as essays, there is likely to be rater unreliability. We already know that these can be minimised, though, and I tend to think that time on this is better spent than time on designing good MCQs, in most classroom language assessment situations. Such test items can also be badly designed though. They can be ambiguous in the way they are written, such that a student who may know their stuff doesn't actually give you what you thought you were asking for. Again, getting the help of colleagues to check the items is useful.

Brown and Abeywickrama (2010) offer some other tips to enhancing test reliability. Don't make the test too long, because while tests can be too short to reliably measure proficiency, they can also be so long that they cause fatigue in test takers. They also point out that some people (like me!) don't cope well with the stress of timed tests.

I'll stop here but if you have something to say about test reliability, please tweet it with .

Test administration reliability

2 min read

After a rather technical topic last week, we're looking now at something that's perhaps more mundane: test administration reliability. (In case you are wondering, I'm following the order in Brown & Abeywickrama, 2010, out of sheer convenience.) This will be a super short post.

This reliability is basically about test conditions. Is the room too hot/cold? Clean? Are the tables and chairs of the right height? Comfortable? Is the room well and evenly lit? Are there distractions, like noise?

If it's a pen and paper test, are the question papers printed clearly? If media is used, is the audio/video clear and of good quality? Generally, does the technology used work as it should? Do the devices run smoothly? Is the projector in good condition (i.e. image not dim or distorted)? Can everyone see/hear the media equally well? Is the internet connection fast and stable? Is there a backup plan should something fail to work?

Essentially, is there anything about the test conditions that would prevent students from doing their best?

These are issues that may take a bit of time to iron out, but are actually relatively easy to take care of. As always, if you have comments or questions, please tweet with .

This work by Moving Mountains Trust is licensed under a Creative Commons Attribution 2.0 Generic Licence.

Rater reliability

5 min read

Last week I blogged about student-related reliability. This week we are tackling something a bit more technical, but I know also of great interest to many teachers: rater reliability. I'm going to cover the less technical aspects of this first, i.e. nothing involving stats. But if stats is your thing, read till the end.

In everyday terms, rater reliability is something that we are concerned with when two (or more) markers mark the same test and one is inevitably stricter than the other (=inter-rater reliability). It's possible then that the same script marked by different markers would get different scores. Not only is this unfair to the student, but it becomes difficult to get an accurate picture of how the cohort is doing as a whole. Rater reliability can also be problematic when only one marker is involved, because we are not always consistent in the way we mark (=intra-rater reliability). For instance, marker's fatigue can affect the consistency of our judgement if we mark too many scripts in one sitting.

Obviously this is not a problem if the test is multiple-choice, true/false, or any other item type where is answer is either right or wrong (i.e. dichotomous). But in language assessment it is generally considered less than valid to assess proficiency in this way only, especially when assessing productive skills (speaking and writing). This is a classic case of the tension between validity and reliability: while there is no validity without reliability, it is possible to sacrifice validity if we pursue perfect reliability. As this is undesirable, the solution is to try and maximise rater reliability, and be always conscious of the fact that this cannot be absolutely consistent as long as human judgement is involved. We might for instance want to give students the benefit of the doubt if their scores are borderline.

However, as responsible teachers, we should try to maximise rater reliability within what is practically possible. Here are some things we can do: 


  • Read through half the scripts without awarding scores, then go back to the beginning to mark for real. If there are multiple items in the test, mark item 1 for all the scripts before marking item 2, etc. When marking electronically, I like to keep a record of comments next to student names and award grades/scores only when I finish the lot.
  • Use an analytic rubric to mark if practical/valid. 
  • Hold standardisation meetings to make sure everyone is interpreting the rubric the same way. Pick a few scripts that exemplify a range of performances and get everyone to mark them so as to check how well they are aligned in their judgement. Markers can mark them first at home before the meeting if scripts are electronic/scanned.
  • Before marking starts, the teacher-in-charge can pull out a few scripts across a range of performances from each teacher's pile, copy them and mark them herself. When the teachers finish marking them, the 2 sets of scores can be compared. Moderation of scores might be necessary if there's a major discrepancy. Obviously this is a lot easier if the scripts are electronic.


The above strategies can easily be applied to speaking tests, given the ease of recording and copying audio these days. They are obviously not problematic if we are marking digital artefacts of any kind (e.g. a video, a blog post). If you know of any other good strategies, please share them on Twitter with .

Okay now for the stats. So I'm by no means a statistician, merely a user of quantitative methods. If you have the time and interest to investigate reliability (any sort, not just rater) statistically, you might like to give this a go because it really isn't very difficult even if you have an aversion to numbers, like me :)

The easiest and most accessible way I think to check reliability is to examine correlation between 2 sets of scores using Excel or similar. There are correlation calculators online too but they can be awkward to use if your dataset is big. Of course if you have a statistics package like SPSS, that is very convenient, and you can even use it to calculate Cronbach's alpha, which is an alternative to correlation. In both cases, the higher the figure you get, the better.

The handy video below shows you how to calculate the correlation statistic Spearman's rho with Excel (check out the creator's site for his Excel files). Due to the nature of rater scores, I think Spearman's rho is more likely to be suitable than the alternative Pearson r, but the PEARSON function is built into Excel so it's even easier to calculate if you want to.

Maybe you are already familiar with correlation and Cronbach's alpha. You might like to know then that calculating, in fact conceptualising, test reliability this way has its problems. However, given that I am writing this for teachers rather than test developers, I'm not going to go there in this post. If you geek out on this kind of stuff, you might like to read this paper I wrote as part of my PhD coursework. If you want a practical textbook on stats for language testing, I recommend Statistical Analyses for Language Testers by Rita Green.

As always, I welcome questions and comments (use ). (Just don't ask me about formulae, please, because they make my head spin...)

Student-related reliability

2 min read

As promised, I'm starting my regular blogging about language testing in the classroom. I'm starting with the new year, as part of 's weekly thematic tweets project.

This work by Marcin Wichary is licensed under a Creative Commons Attribution 2.0 Generic Licence.

I'm starting with the concept of reliability, but not as psychometricians and statisticians see it since that will be of limited usefulness in the classroom. Instead, we're going to look at the different aspects of reliability that teachers will be able to apply (more) easily.

We'll start with student-related reliability (Brown & Abeywickrama, 2010). If you've ever had to take a test when you're ill, tired, unhappy or otherwise having a bad day, you'll have a good idea of what this is. Such things are likely to make your test results an inaccurate measure of your true proficiency, since you are not performing at your best. Conversely, if you are an experienced test-taker, have taken many practice tests and can apply good test-taking strategies, you might well do better than a classmate who hasn't, even if you are both equally proficient.

In working to minimise student-related unreliability, it's worth thinking about the last time you've had a test. What were your sources of anxiety? How can we make sure that the test is 'biased for best' and that each student performs optimally? One thing that we should absolutely avoid is trying to be tricky or scary. Some teachers are fond of including 'trick questions' in their tests to make them more challenging, but in fact such questions often don't measure what they've set out to measure at all (i.e. not tapping on language proficiency to answer correctly). This affects the validity of the test (more on validity in later posts).

As teachers, we can also ensure that students are aware of the test format and have good test-taking strategies. This levels the playing field, promotes student confidence, and helps to ensure we are obtaining reliable information about student ability.

What are some tips and strategies you have for minimising student-related unreliability? Do you have any questions and comments regarding this post? Share them on Twitter with the hashtag .