Skip to main content

Educator, applied linguist, language tester.

hsiaoyun.net/

groups.diigo.com/group/assessment-literacy

ATC21s week 4: Teaching collaborative problem solving

8 min read

This week is the final week of video lectures, so I'll end my reflective posts with today's too (I am not doing the assignment).

Video 4.0 discusses the differences between collaboration and cooperation. Collaboration is also not group work (though I don't see how group work can't sometimes end up being collaborative in their sense of the word?) Collaboration is actually defined very specifically here, to the extent that I see why it might be hard to find authentic problems that fit this model perfectly for the classroom.

4.1 discusses inter and intra individual differences among learners. Individuals develop at different rates different aspects of the CPS construct. As an example of how the teacher might cater to such a diverse group, a tertiary level CPS task is described. I'm not very clear about the objective of this task so I won't go into it here. It's highlighted that the same task can be scaffolded differently for different students depending on their cognitive and social development. But what if within the group, members are at different stages of development? How can we scaffold the task differently for each person if they are all working together on the same task? In 4.3 the researcher interviewed mentions teachers getting students to read up on different topics so that they each bring something different to the problem solving. I suppose something like this could be done, though it smacks of too much teacher engineering -- is this still collaboration and not cooperation?

4.2 is an interview with Griffin and Care. Care explains that the kind of CPS tasks used in their study only tap into the 'tip of the iceberg' as far as the CPS construct is concerned. This indicates that they do indeed believe that such puzzles (as I call them) measure the same construct as more complex (more authentic?) tasks. I remain rather dubious about this. Reading comprehension is mentioned here for comparison:

whether it's multiple choice or some open item questions and what we get from that assessment is some indication of the skill

But of course MCQ and open-ended items are likely to be differently valid, and this isn't trivial -- one may tap into more of the iceberg-construct than the other. Or tap into a different iceberg! My point is that construct validity cannot be assumed. It could be that ATC21s has research showing their CPS and real-life CPS are the same construct of course, and I've missed an article somewhere.

Griffin then says:

The issue that you spoke about with the reading test, we've managed over a hundred years or so, to become very good at that kind of assessment, and in interpreting that. And we know now that that one piece of text and the two or three questions that are associated with it, are only a sample of what we could do. So, we build more and more complicated texts. Yeah. And more difficult questions and we address higher and higher skills in the test. So over a 40 item test of a reading comprehension we go from very simple match the word to this picture through to judging an author's intention. Yeah. And so behind the multiple choice questions on a piece of paper there's also a lot of complex thinking that goes on and there's, behind that, there's the idea of a developmental progression, or a construct that we're mapping, but the teacher, the student never sees that until after the test has been developed, interpreted and reported.

Well now. This is such an oversimplification of what we know about assessing reading that it's at best a poorly chosen analogy.

It's then claimed that while the learners are enjoying their games, the researchers are actually assessing their CPS development. Putting aside the fact that teachers likely won't have access to such games (as I've already pointed out previously), I'm not sure about the suggestion here that this is a good example of assessing through games. It's been awhile since my gaming days but I would never consider these good games. I really think you need good games for game-based learning and the same goes for assessment.

a reading test is often read a passage, look at a question, choose an alternative out of four possible alternatives by pressing a button or ticking a box. What we have done enables a much more complex view of that. We can now tap into what's going on in the background behind the student's reading comprehension, what they're thinking while they're trying to work out what alternative they choose.

Hmm. I think they should just stop referring to reading assessment unless they've really got some novel reading assessment along the same lines as their CPS 'games' that somehow has never been disclosed to language testers.

And finally we come to what I see as the crux of it all:

You know, one of the challenges for us still is that, we don't know yet whether the skills that we're picking up will generalize to real life situations. That's one of the big issues. And in part we're, we're hampered and we're constrained, because of the nature of how you pick up the assessment data. You know, because, if we're talking about the sorts of problems to which we want to bring collaborative problem solving, like big problems, or say, global warming, The issue is that in the school context what you typically give to student is well defined problems. Problems that they've given a lot more scaffolding to work through, they're given structure to work through. If we go too far down that path, too much structure and too much scaffolding, they won't learn the particular underlying skills that we need that they can then generalize to take to the big problems. So there's some real issues in our assessments.

I don't know if I understand this correctly. Are they saying that their CPS tasks are well-defined because this is a constraint of school? And that this can be compensated for by providing less structure and scaffolding? IMO schools can definitely do CPS differently if they want to. And they can't do it ATC21s style anyway, with their kind of electronic games. It seems to me that the well-definedness of their games is a constraint of their research design, not of schools, thus that part about being constrained by 'the nature of how you pick up the assessment data'. But assessment data can be 'picked up' in different ways.

4.3 is an interview with a researcher who is working with teachers on implementing CPS in their schools. This is an interesting account that I think teachers on the course would want to know more about. At the start, the researchers says that the teachers had their students do the online CPS tasks so that they had a baseline to work with. Could all schools do this? What if they couldn't? In 4.4 we hear from the 2 schools involved in the study, and again while interesting, it would be even more interesting to hear from schools that implement this without the technological support of the research team.

4.4 is a recap, with some future directions. It's pointed out that teachers have to be effective at collaborating themselves if they want to teach it to their students, and I wholeheartedly agree. That said, if they are taught this in pre-service in the same tip-of-the-iceberg way, I'm not sure if they would be prepared for real-world collaboration in the school.

There are also some extra videos available I think just this week, under resources. One of the them is called 'Learning in digital networks', and it suggests that this sort of CPS task gives learners a start to their ICT literacy or learning in digital networks. I really don't know about this. Given the rich digitally networked environment kids live in (at least in developed countries), do they really need to start with something like this? Do we have to train them on a toy 'internet' before they know how to learn on the real one? Chances are many already are learning and collaborating on the real internet.

This highlights the difference between this course's orientation to 21st century competencies and mine. ATC21s takes a more cognitive, more skills-based, more measurement-centric approach that while contributing a great deal to our understanding of such competencies, may also be limited in usefulness in transforming learning in the clasroom. I like that the ATC21s team are clearly more interested in learning and development, but I think a more social practice approach (to competencies, to assessment) is better aligned with formative aims and better able to achieve them. This is probably my research bias talking, so I'll stop here.

I hope you've enjoyed my reflective posts on the ATC21s MOOC! Next week, something new.

#rhizo15 week 2: Learning is uncountable, so what do we count?

4 min read

This isn't one of my scheduled posts for thematic tweets, and has nothing to do with as such. It's a little something for me to get my feet wet with . I've been hesitant to get started with because I doubted my ability to contribute something. Given my issues with the much easier , though, I thought I should try harder with , and balance my first real xMOOC experience with a cMOOC one.

As I type this, week 3 has already started, but I'll post my week 2 contribution anyway -- it was hard enough to come up with! Here's Dave's week 2 prompt. You'll note that it's conveniently right up my assessment alley. I don't know if I can respond to week 3's the same way!

Warning: my response is a rough, incomplete thing but maybe this is par for the course for rhizo learning. (I should confess here that I am ambivalent about rhizomatic learning as a theory, and hope that this experience helps to sort out my ideas about it.)

Okay. So we can't count learning. But I've always accepted this. Psychometricians working with Item Response Theory talk about latent traits: 'latent is used to emphasize that discrete item responses are taken to be observable manifestations of hypothesized traits, constructs, or attributes, not directly observed, but which must be inferred from the manifest responses' (Wikipedia). 

So when we assess, we are not measuring actual traits (or abilities) but the quality of evidence of such. It's all inferred and indirect, so we can't measure learning in the sense of holding a ruler up to it ('let's look at how much you've grown!').

Also learning happens continuously -- we can't start and stop at will. We can't measure it, even indirectly, as you might temperature, in real time. By the time the test finishes and is marked or feedback given, learning has already moved on.

So we never measure learning per se. As Lisa says, it's only symbolic. It's just a useful fiction.

But perhaps Dave's question is not about measuring quality of such tangible evidence? At least the conventional kind?

If it isn't about product, is it about process, as some teachers already do assess?

Are we talking about measuring 21st century 'skills' like CPS (see previous post)? has very cleverly broken down CPS into more easily measurable bits, but when a construct is broken down like that, its integrity tends to suffer (something I forgot to include in my previous post). Is it about measuring literacies (situated social practices), as I'm attempting to tackle in my study? Learning dispositions?

But tangible evidence is also required to 'see' all the above. Are we talking of true 'unmeasurables', if psychometricians admit to any? What might they be?

Maybe it's about assessment that isn't externally imposed -- self assessment? How do we activate learners as owners of their own learning, as per Wiliam's framework of formative assessment? How do we make reflective learning second nature?

How can we give self assessment currency, given stakeholders' obsession with reliability of measurement and 'fairness'? How can we give it validity? And have people understand and accept that validity?

Which leads to heutagogy. We have to be good at it to cultivate it in others; our education ministry says teachers should cultivate Self-directed Learning capabilities in our learners, but how do they cultivate it in themselves? How can we be self directed about SDL?

How about we throw out quantitative measures? No counting! Maybe that's how we throw out the comparing and ranking of norm-referenced assessment that people tend to default to (I'm not sure how many participants really got criterion-referencing.)

How about we become ethnographers of learning? Help learners become autoethnographers of their own learning? The kind that's mostly, if not 100%, qualitative. (Before you say that the average teacher has too much to do, recall that she has an entire class of potential research assistants.) I'm sure this is (as usual) not an original idea. Do you know anyone who's tried it?

'Not everything that can be counted counts, and not everything that counts can be counted.' - William Bruce Cameron

ATC21s week 2: A closer look at 21st century skills: collaborative problem solving

7 min read

This week I'm somewhat distracted by an upcoming trip to Bangkok to present at the 2nd Annual Asian Association for Language Assessment Conference. This is the first time I am formally presenting on my study, so I'm quite nervous! Fortunately I was able to squeeze in some time for week 2 of .

Here's a quick summary of this week's lesson:

1. What is collaborative problem solving (CPS)? There are existing problem solving models (cited are Polya, 1973, and PISA, 2003/2012), but they do not include the collaborative component. Therefore ATC21S has come up with their own:

  • Collect and share information about the collaborator and the task
  • Check links and relationships, organise and categorize information
  • Rule use: set up procedures and strategies to solve the problem using an “If, then..” process
  • Test hypotheses using a “what if” process and check process and solutions

The CPS construct is made up of social skills and cognitive skills.

2. Social skills are participation, perspective taking and social regulation skills. These can be further unpacked:

  • Participation: action, interaction and task completion
  • Perspective taking: responsiveness and audience awareness
  • Social regulation: Metamemory (own knowledge, strengths and weaknesses), transactive memory (those of partners), negotiation and responsibility initiative

There are behavioural indicators associated with each of these elements. (At this point, I was pretty sure that Care and Griffin don't mean to suggest that teachers conduct Rasch analysis themselves, but rather use already developed developmental progressions.)

3. Cognitive skills are task regulation, and knowledge building and learning skills:

  • Task regulation: problem analysis, goal-setting, resource management, flexibility and ambiguity management skills, collects information, and systematicity
  • Knowledge building and learning: relationships, contingencies and hypothesis Again, each element has associated indicators.

4. We come back to the developmental approach that integrates the work of Rasch, Glaser and Vygotsky. Teachers need a framework that they can use to judge where their students are in their CPS development. There are existing ones (such as the ubiquitous Bloom's), but none are suited to measuring CPS skills. So what we need is a new empirically derived framework that allows teachers to observe students in CPS action and judge where they are.

5. Empirical progressions are explained, and examples such as PISA and TIMMS given. We are then presented with the progression that ATC21S has developed for CPS. The table is too large to reproduce here, but essentially it expands all the elements in 3 and 4 into progressions so that you end up with five scales. 

 

Impressive right? Except I'm not quite sure about the tasks they used to developed this. The example they showed was of two students connected by the internet and chatting by typing, attempting to solve what appears to be more of a puzzle than a problem. That is, the sort of problem teachers cook up to test students' intellectual ability (shades of ?) The 2nd volume of the book series actually has a chapter that discusses this in more detail and seems to confirm that they used puzzles of this sort. I understand of course that doing it in this way makes it easier to collect the sort of data they wanted. But given that the tasks aren't very authentic, to what extent are they representative of the target domain? Are there issues of construct validity? I will need to read further, if there is available literature, before I make up my mind. It would be interesting, if not already done, to conduct a qualitative study using more authentic problems, more students per team, observation, artefact collection, (retrospective) interviews, and so on. You won't get the quantity of data as with the study but this sort of rich data could help us check the validity of the framework. It could also be of more practical value to teachers who actually have to teach and assess this without fancy software and a team of assistants.

I won't deny that I'm rather disappointed that Rasch measurement is really 'behind the scenes' here, though I'm not surprised. I can't help but wonder if it's really necessary to make Rasch appear so central in this course, especially since some of my classmates seem to misunderstand its nature. This is not surprising -- Rasch is not the sort of thing you can 'touch and go' with. There is some confusion about criterion referencing too (IMO it's hard to make sense of it without comparing it to norm referencing and explaining how they are used in assessment usually). ZPD is faring a little better, probably since it's familiar to most teachers. I am however surprised to see it occasionally referred to rather off-handedly, as if it's something that's easy to identify.

Would it make more sense to focus more on the practicalities of using an established developmental progression? It's too early to say I guess, but already quite a few of my classmates are questioning the practicality of monitoring the progress of large classes. This is where everyday ICT-enabled assessment strategies can come into play. I also hope to see more on how to make assessments really formative. I learnt from the quiz this week (if it was mentioned elsewhere I must have missed it) that assessments that are designed to measure developmental progression are meant to be both formative and summative. Okay, great, but IMO it's all too easy to miss the formative part completely without even realising it -- remember that an assessment is only formative if there's a feedback loop. The distinction between the two uses cannot be taken lightly, and there really is no point harping on development and ZPD and learning if we ignore how assessment actually works to make progress happen.

Which brings me to the assessment on this course. If you're happy with the quizzes so far you might want to stop reading here.

 

Diligent classmates may have noticed from my posts that I REALLY do not like the quizzes. Initially it was the first so-called self assessment that I took issue with. Briefly, its design made it unfit for purpose, at least as far as I'm concerned. After doing another 'self-assessment' for week 2 and the actual week 2 quiz, I'm ever more convinced that the basic MCQ model is terrible for assessing something so complex. It's quite ironic that a course on teaching and assessing 21C skills should utilise assessments that are assuredly not 21C. Putting what could be a paper MCQ quiz online is classic 'old wine in new bottle', and really we cannot assess 21C skills with 19C or 20C ways. I have written (to explain my own study) that:

... digital literacies cannot be adequately assessed if the assessment does not reflect the nature of learning in the digital age. An assessment that fails to fully capture the complexity of a construct runs the risk of construct under-representation; that is, being ‘too narrow and [failing] to include important dimensions or facets of focal constructs’ (Messick, 1996, p. 244).

Surely we cannot claim that the understanding of assessing and learning 21C skills is any less complex than 21C skills themselves? Of my initial findings, I wrote that:

We may be able to draw the conclusion that assessing digital literacies are 21st century literacies twice over, in that both digital literacies and the assessment thereof are new practices that share similar if not identical constituents.

Telling me that the platform can't do it differently is an unsatisfactory answer that frankly underlines the un-21C approach taken by this course. 21C educators don't allow themselves to be locked in by platforms. It seems that the course designers have missed out on a great opportunity to model 21C assessment for us. I'm not saying that it would be easy, mind you. But is it really possible that the same minds who developed an online test of CPS can't create better than the very average xMOOC?

Okay, I should stop here before this becomes an awful rant that makes me the worst student I never had. I am learning, really, even if sometimes the learning isn't what's in the LOs. And I will continue to persevere and maybe even to post my contrary posts despite the threat of being downvoted by annoyed classmates :P

Washback

2 min read

Even if you are not familiar with the term, you are probably familiar with the concept of washback (commonly called backwash in educational assessment). It refers to the effects of assessment on teaching and learning, and anyone who's studied in an exam-oriented system would have experienced this.

We tend to think poorly of washback because we often think of negative washback, e.g. ignoring what's in the syllabus in favour of what will be in the exam, even if we think that the syllabus has more worthy learning outcomes. While washback can be very problematic, I think we do need to consider two things.

First, as long as high-stakes exams determine a person's educational prospects, it's pretty unfair to blame teachers (and parents and learners) for their preoccupation with preparing students for exams. I don't mean to say that teachers etc. should willingly let exams lead them by the nose, and applaud those who can look beyond exams to think and act with true education in mind. However, we would be doing our students a disservice if we didn't prepare them adequately for exams (think face validity and student-related reliability). The point is not to obsess over them and let them overrun the curriculum.

Second, washback can be positive, and we should try to leverage this. While national exams are not within our control (though we may be able to exert some subtle influence), classroom assessments are -- make sure these are aligned with our intended learning outcomes. I believe that real learning will serve students well in their exams, and that obsessive exam prepping is unnecessary.

How do you deal with washback? Let us know on Twitter with

Practicality

2 min read

If you've followed this series so far, you might be thinking that wow it's hard to make a test reliable and valid -- too hard!

Well actually the first principle of language assessment discussed in Brown and Abeywickrama (2010) is 'practicality'. You could design the most reliable and valid test in the world, but if it's not practical to carry out, you know it isn't going to happen the way you planned it. My take on this is that we can try our best to be reliable and valid in our assessment, but also be realistic about what is achievable given limited resources.

For instance, an elaborate rubric might be more reliable to mark with, but if it's too complex to use easily and you have a big group of students, you might not use it the way it's intended, because it just takes too much time to mark one script. As a result, reliability suffers, because different teachers end up handling the complexity of the rubric in different ways.

Another example: we know that double marking is more reliable, but we also recognise that double marking every script of every test is just not feasible. In such a case, we have to make other efforts at maximising reliability.

Having said this, I think we can sometimes think of creative ways to maximise reliability and validity while still being realistic about what is doable. Take for instance standardisation meetings, which can be a drag because they take up so much valuable time. As I mentioned before, markers can be given the scripts prior to the meeting to mark at home, or they might even discuss the scripts online (e.g. by annotating them on shared Google Docs). I believe that technology can offer ways to make test administration more reliable and valid in more effective and efficient ways, and we should not therefore immediately discard a possible measure because of its perceived impracticality.

Have you got tips and strategies to maximise reliability and validity more efficiently? Please share on !

Face validity

1 min read

So far we haven't considered the test-taker's point of view. Face validity refers to exactly this: does the test look right and fair to the student?

Of course, one might argue that students are not usually the best judge of validity. But their opinion, however flawed, can affect their performance. You want students to be confident and low in anxiety when taking a test, because you want to maximise student-related reliability, as mentioned in an earlier post.

Brown and Abeywickrama (2010) advise teachers to use:

  • a well-constructed, expected format with familiar tasks
  • tasks that can be accomplished within an allotted time limit
  • items that are clear and uncomplicated
  • directions that are crystal clear
  • tasks that have been rehearsed in their previous course work
  • tasks that relate to their course work (content validity)
  • a difficulty level that presents a reasonable challenge

(p. 35)

As always, please share your thoughts on .

Construct validity

3 min read

This post is a bit challenging to write, partly because the concept of 'construct' is hard to explain (for me), and partly because construct validity is so central to discussions of validity in the literature.

When I started blogging about validity, I wrote that we can take the concept to mean asking the question 'does the test measure what it's supposed to measure?' We can now think a bit further as to what is actually being measured by tests. A test can only measure things that can be observed.

Say we are attempting to figure out a student's writing ability (maybe your typical school composition kind of writing). We can't actually directly measure your construct -- that mysterious, abstract ability called 'writing' -- but we do have an idea as to what it looks like. To try to fully assess it we might look at all the things that make up the ability we know as 'writing'. These are the kind of things that you will find in your marking rubric (they are there because we think that they are signs a person is good or bad at writing): organisation, grammar, vocabulary, punctuation, spelling, etc.

So we look at what we are measuring when we assess writing, and ask ourselves if these things do indeed comprehensively make up the ability we know as 'writing'. Is anything missing (construct underrepresentation)? Is there anything there that shouldn't be there because it has nothing to do with writing per se (construct irrelevance)? Imagine a writing test that didn't include marking for 'grammar', or one that required you to do a lot of difficult reading before you write. Certainly you can test writing in either of these ways, but you'd need to be clear as to what your construct is, how it differs from the more commonly understood construct of 'writing' and why. You could argue for a construct of reading+writing based on research findings, for example.

What I've written above is probably a gross over-simplification (maybe more so than usual). If you'd like a more technical explanation, I recommend JD Brown's article for JALT. It isn't long, and I love how it even includes, even if briefly, Messick's model of validity. This model is so important to our understanding of testing that I'm going to include here McNamara and Roever's (2006) interpretation of the model, in the hope that it might give you some food for thought over the long LNY weekend ;-)


Source: McNamara (2010)

Criterion validity

3 min read

This work by frankleleon is licensed under a Creative Commons Attribution 2.0 Generic Licence

Okay, so you've designed a test and you've decided that if the students reach a certain mark or grade (or meet certain criteria), they have achieved the learning outcomes you're after. But are you really sure? How can you know? This is essentially the question we aim to answer when we consider criterion validity.

We can consider two aspects of criterion validity: concurrent validity and predictive validity.

To establish concurrent validity, we assess students in another way for the same outcomes, to see for example if those who performed well in the first assessment really have that level of proficiency. In my previous post on content validity, I gave the example of an MCQ grammar test vs an oral interview speaking test, to measure grammatical accuracy in speaking. To check the concurrent validity of the MCQ test, you could administer both tests to the same group of students, and see how well the two sets of scores correlate. (This does assume you are confident of the validity of the speaking test!) In a low stakes classroom testing situation, you might not have the time to administer another test, but you could for instance call up a few students for a short talk, and check their grammatical accuracy that way. You might pick the students who are borderline passes -- this could show you whether your pass mark is justified.

As for predictive validity, this is really more important when the test scores determine the placement of the student. Singapore schools typically practise streaming and/or banding to place students with others of the same level. If the test we use to determine their placement does not have predictive validity, that means there is a good chance the student would not be successful in that group. Which kind of defeats the purpose of streaming/banding! We can't predict the future, but we can compare past and future performances. We could for instance compare the test scores of students a few months into their new placement with the test scores we used to determine their placement. If there are students who perform much better or poorer than you would reasonably expect, it's time to re-examine the original test, and probably move the students to a more suitable class too.

That's about it for criterion validity. As always, tweet your comments and questions with .

Content validity

2 min read

This work by Nevit Dilmen is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported Licence.

This week we turn to validity. It can be a tricky concept and in fact was something that took me some time to 'get' at first. The easiest approach I've found to interpreting 'validity' is to ask the question: does the test measure what it's supposed to measure?


What do we want it to measure? Do we know what we want to measure, in the first place? So often I think we plan assessment without being perfectly clear about our purpose. (Sometimes it's because our learning outcomes aren't very clear to begin with.) Brown and Abeywickrama (2010) list other qualities of a valid test but I think the above definition is enough to work with for now.


Following the same book, I'm starting with content-related validity. This is pretty straightforward: is what you want to test in the test? This might seem kind of 'duh' but it's actually a trap that's quite easy to fall into. For instance, our purpose might be to test learners' grammatical accuracy when speaking, but instead of actually getting them to speak, we set an MCQ grammar test. The former would be a direct test, while the latter (arguably) an indirect test of the same thing. Indirect tests are often used for reasons of practicality and reliability; obviously it's much easier to mark a class's MCQ test (it could even be done automatically) than to administer an individual oral test for each student.


If it really isn't possible to achieve high content validity, then we've got to look into the other validities of our test. More of those in the coming weeks. In the meantime, keep your questions and comments coming with on Twitter.


Test reliability

3 min read

This post is for those of you who set your own assessments. Which I guess we all have to sooner or later!

There are all sorts of 'best practices' you can read about test reliability in large-scale, 'standardised' tests (I use 'standardised' here in the true sense, i.e. not exclusively Multiple Choice Questions). As usual, though, I will concentrate on what is practical to do within the context of the classroom.

I want to start with MCQs in fact, because we usually think of them, or any other sort of dichotomously scored items (T/F, matching, etc), as being the most reliable. However, if you've read my post on rater reliability, you'll recall that the validity of such items can be questionable (i.e. do they really test what we want to test?) They can also be unreliable in unexpected ways. For instance, it's quite common to find MCQ items with more than one correct answer, and so students who choose the 'unofficial' correct answer end up being marked wrong. This might not always be apparent to us as test writers, so it's a good idea to get colleagues to check. Having been test takers ourselves, we must know too that it's all too tempting to tikam (i.e. guess) when we don't know the right answer to an MCQ. So your student might get the right answer by sheer luck. Sometimes they guess the right answer because of irrelevant clues (e.g. it's longer/shorter than the other options).

MCQs aren't necessarily bad items, but they do require a lot of time and effort to design well, and should be avoided unless you are willing to invest both. Perhaps you are designing a large scale test that you want to be able to mark quickly, and will build up a test bank of recyclable test items over time. There are lots of good advice out there for MCQ test designers.


This work by gulia.forsythe is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Generic Licence

If we go with subjectively scored items such as essays, there is likely to be rater unreliability. We already know that these can be minimised, though, and I tend to think that time on this is better spent than time on designing good MCQs, in most classroom language assessment situations. Such test items can also be badly designed though. They can be ambiguous in the way they are written, such that a student who may know their stuff doesn't actually give you what you thought you were asking for. Again, getting the help of colleagues to check the items is useful.

Brown and Abeywickrama (2010) offer some other tips to enhancing test reliability. Don't make the test too long, because while tests can be too short to reliably measure proficiency, they can also be so long that they cause fatigue in test takers. They also point out that some people (like me!) don't cope well with the stress of timed tests.

I'll stop here but if you have something to say about test reliability, please tweet it with .