Skip to main content

Content validity

2 min read

This work by Nevit Dilmen is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported Licence.

This week we turn to validity. It can be a tricky concept and in fact was something that took me some time to 'get' at first. The easiest approach I've found to interpreting 'validity' is to ask the question: does the test measure what it's supposed to measure?


What do we want it to measure? Do we know what we want to measure, in the first place? So often I think we plan assessment without being perfectly clear about our purpose. (Sometimes it's because our learning outcomes aren't very clear to begin with.) Brown and Abeywickrama (2010) list other qualities of a valid test but I think the above definition is enough to work with for now.


Following the same book, I'm starting with content-related validity. This is pretty straightforward: is what you want to test in the test? This might seem kind of 'duh' but it's actually a trap that's quite easy to fall into. For instance, our purpose might be to test learners' grammatical accuracy when speaking, but instead of actually getting them to speak, we set an MCQ grammar test. The former would be a direct test, while the latter (arguably) an indirect test of the same thing. Indirect tests are often used for reasons of practicality and reliability; obviously it's much easier to mark a class's MCQ test (it could even be done automatically) than to administer an individual oral test for each student.


If it really isn't possible to achieve high content validity, then we've got to look into the other validities of our test. More of those in the coming weeks. In the meantime, keep your questions and comments coming with on Twitter.


Test reliability

3 min read

This post is for those of you who set your own assessments. Which I guess we all have to sooner or later!

There are all sorts of 'best practices' you can read about test reliability in large-scale, 'standardised' tests (I use 'standardised' here in the true sense, i.e. not exclusively Multiple Choice Questions). As usual, though, I will concentrate on what is practical to do within the context of the classroom.

I want to start with MCQs in fact, because we usually think of them, or any other sort of dichotomously scored items (T/F, matching, etc), as being the most reliable. However, if you've read my post on rater reliability, you'll recall that the validity of such items can be questionable (i.e. do they really test what we want to test?) They can also be unreliable in unexpected ways. For instance, it's quite common to find MCQ items with more than one correct answer, and so students who choose the 'unofficial' correct answer end up being marked wrong. This might not always be apparent to us as test writers, so it's a good idea to get colleagues to check. Having been test takers ourselves, we must know too that it's all too tempting to tikam (i.e. guess) when we don't know the right answer to an MCQ. So your student might get the right answer by sheer luck. Sometimes they guess the right answer because of irrelevant clues (e.g. it's longer/shorter than the other options).

MCQs aren't necessarily bad items, but they do require a lot of time and effort to design well, and should be avoided unless you are willing to invest both. Perhaps you are designing a large scale test that you want to be able to mark quickly, and will build up a test bank of recyclable test items over time. There are lots of good advice out there for MCQ test designers.


This work by gulia.forsythe is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Generic Licence

If we go with subjectively scored items such as essays, there is likely to be rater unreliability. We already know that these can be minimised, though, and I tend to think that time on this is better spent than time on designing good MCQs, in most classroom language assessment situations. Such test items can also be badly designed though. They can be ambiguous in the way they are written, such that a student who may know their stuff doesn't actually give you what you thought you were asking for. Again, getting the help of colleagues to check the items is useful.

Brown and Abeywickrama (2010) offer some other tips to enhancing test reliability. Don't make the test too long, because while tests can be too short to reliably measure proficiency, they can also be so long that they cause fatigue in test takers. They also point out that some people (like me!) don't cope well with the stress of timed tests.

I'll stop here but if you have something to say about test reliability, please tweet it with .

Test administration reliability

2 min read

After a rather technical topic last week, we're looking now at something that's perhaps more mundane: test administration reliability. (In case you are wondering, I'm following the order in Brown & Abeywickrama, 2010, out of sheer convenience.) This will be a super short post.

This reliability is basically about test conditions. Is the room too hot/cold? Clean? Are the tables and chairs of the right height? Comfortable? Is the room well and evenly lit? Are there distractions, like noise?

If it's a pen and paper test, are the question papers printed clearly? If media is used, is the audio/video clear and of good quality? Generally, does the technology used work as it should? Do the devices run smoothly? Is the projector in good condition (i.e. image not dim or distorted)? Can everyone see/hear the media equally well? Is the internet connection fast and stable? Is there a backup plan should something fail to work?

Essentially, is there anything about the test conditions that would prevent students from doing their best?

These are issues that may take a bit of time to iron out, but are actually relatively easy to take care of. As always, if you have comments or questions, please tweet with .

This work by Moving Mountains Trust is licensed under a Creative Commons Attribution 2.0 Generic Licence.

Rater reliability

5 min read

Last week I blogged about student-related reliability. This week we are tackling something a bit more technical, but I know also of great interest to many teachers: rater reliability. I'm going to cover the less technical aspects of this first, i.e. nothing involving stats. But if stats is your thing, read till the end.

In everyday terms, rater reliability is something that we are concerned with when two (or more) markers mark the same test and one is inevitably stricter than the other (=inter-rater reliability). It's possible then that the same script marked by different markers would get different scores. Not only is this unfair to the student, but it becomes difficult to get an accurate picture of how the cohort is doing as a whole. Rater reliability can also be problematic when only one marker is involved, because we are not always consistent in the way we mark (=intra-rater reliability). For instance, marker's fatigue can affect the consistency of our judgement if we mark too many scripts in one sitting.

Obviously this is not a problem if the test is multiple-choice, true/false, or any other item type where is answer is either right or wrong (i.e. dichotomous). But in language assessment it is generally considered less than valid to assess proficiency in this way only, especially when assessing productive skills (speaking and writing). This is a classic case of the tension between validity and reliability: while there is no validity without reliability, it is possible to sacrifice validity if we pursue perfect reliability. As this is undesirable, the solution is to try and maximise rater reliability, and be always conscious of the fact that this cannot be absolutely consistent as long as human judgement is involved. We might for instance want to give students the benefit of the doubt if their scores are borderline.

However, as responsible teachers, we should try to maximise rater reliability within what is practically possible. Here are some things we can do: 

 

  • Read through half the scripts without awarding scores, then go back to the beginning to mark for real. If there are multiple items in the test, mark item 1 for all the scripts before marking item 2, etc. When marking electronically, I like to keep a record of comments next to student names and award grades/scores only when I finish the lot.
  • Use an analytic rubric to mark if practical/valid. 
  • Hold standardisation meetings to make sure everyone is interpreting the rubric the same way. Pick a few scripts that exemplify a range of performances and get everyone to mark them so as to check how well they are aligned in their judgement. Markers can mark them first at home before the meeting if scripts are electronic/scanned.
  • Before marking starts, the teacher-in-charge can pull out a few scripts across a range of performances from each teacher's pile, copy them and mark them herself. When the teachers finish marking them, the 2 sets of scores can be compared. Moderation of scores might be necessary if there's a major discrepancy. Obviously this is a lot easier if the scripts are electronic.

 

The above strategies can easily be applied to speaking tests, given the ease of recording and copying audio these days. They are obviously not problematic if we are marking digital artefacts of any kind (e.g. a video, a blog post). If you know of any other good strategies, please share them on Twitter with .

Okay now for the stats. So I'm by no means a statistician, merely a user of quantitative methods. If you have the time and interest to investigate reliability (any sort, not just rater) statistically, you might like to give this a go because it really isn't very difficult even if you have an aversion to numbers, like me :)

The easiest and most accessible way I think to check reliability is to examine correlation between 2 sets of scores using Excel or similar. There are correlation calculators online too but they can be awkward to use if your dataset is big. Of course if you have a statistics package like SPSS, that is very convenient, and you can even use it to calculate Cronbach's alpha, which is an alternative to correlation. In both cases, the higher the figure you get, the better.

The handy video below shows you how to calculate the correlation statistic Spearman's rho with Excel (check out the creator's site for his Excel files). Due to the nature of rater scores, I think Spearman's rho is more likely to be suitable than the alternative Pearson r, but the PEARSON function is built into Excel so it's even easier to calculate if you want to.


Maybe you are already familiar with correlation and Cronbach's alpha. You might like to know then that calculating, in fact conceptualising, test reliability this way has its problems. However, given that I am writing this for teachers rather than test developers, I'm not going to go there in this post. If you geek out on this kind of stuff, you might like to read this paper I wrote as part of my PhD coursework. If you want a practical textbook on stats for language testing, I recommend Statistical Analyses for Language Testers by Rita Green.

As always, I welcome questions and comments (use ). (Just don't ask me about formulae, please, because they make my head spin...)

Student-related reliability

2 min read

As promised, I'm starting my regular blogging about language testing in the classroom. I'm starting with the new year, as part of 's weekly thematic tweets project.


This work by Marcin Wichary is licensed under a Creative Commons Attribution 2.0 Generic Licence.

I'm starting with the concept of reliability, but not as psychometricians and statisticians see it since that will be of limited usefulness in the classroom. Instead, we're going to look at the different aspects of reliability that teachers will be able to apply (more) easily.

We'll start with student-related reliability (Brown & Abeywickrama, 2010). If you've ever had to take a test when you're ill, tired, unhappy or otherwise having a bad day, you'll have a good idea of what this is. Such things are likely to make your test results an inaccurate measure of your true proficiency, since you are not performing at your best. Conversely, if you are an experienced test-taker, have taken many practice tests and can apply good test-taking strategies, you might well do better than a classmate who hasn't, even if you are both equally proficient.

In working to minimise student-related unreliability, it's worth thinking about the last time you've had a test. What were your sources of anxiety? How can we make sure that the test is 'biased for best' and that each student performs optimally? One thing that we should absolutely avoid is trying to be tricky or scary. Some teachers are fond of including 'trick questions' in their tests to make them more challenging, but in fact such questions often don't measure what they've set out to measure at all (i.e. not tapping on language proficiency to answer correctly). This affects the validity of the test (more on validity in later posts).

As teachers, we can also ensure that students are aware of the test format and have good test-taking strategies. This levels the playing field, promotes student confidence, and helps to ensure we are obtaining reliable information about student ability.

What are some tips and strategies you have for minimising student-related unreliability? Do you have any questions and comments regarding this post? Share them on Twitter with the hashtag .