Skip to main content

Alphabet soup: AfL, AaL, LOA

2 min read

Last week, my post on formative assessment (and a subsequent tweet asking for suggestions) sparked a short conversation on Twitter with @ashley about Assessment for Learning and Assessment as Learning, as well as Learning Oriented Assessment. I'm still looking for suggestions for this blog (let me know!); in the meantime, here's my attempt at sorting out these concepts.

Assessment for Learning (AfL) is for all intents and purposes formative assessment. It's useful here to revisit Dylan Wiliam @dylanwiliam's table:

Assessment as Learning was originally proposed by Lorna Earl @lmearl. While often differentiated from AfL, if we accept Wiliam's definition of AfL, AaL is more accurately a subset of AfL:

Learning Oriented Assessment is the 'new' kid on the assessment block:

Figure from
Carless (2007)

Originally proposed by David Carless @carlessdavid and his colleagues, the concept should ring a bell for those of you who are familiar with the backward design approach to curriculum. This approach includes Understanding by Design (Wiggins @grantwiggins & McTighe @jaymctighe), popular in K-12:

(taken from here; original source unknown)

And also Biggs's Constructive Alignment (well-known in HE):

Diagram by UCD Teaching & Learning

I see LOA as a model that not only employs backward design, but does it in a way that foregrounds formative assessment (including AaL). It also deemphasises the distinction between summative and formative assessment in a way that might actually be constructive -- the key is to make summative assessment perform a learning-oriented service, in addition to institutional purposes. I say constructive because seeing the two assessments as a dichotomy (mutually exclusive) could put teachers and learners in a bind -- we can't do away with summative assessments because of institutional demands, and positioning them as the 'bad guys' doesn't necessarily eliminate washback. IMO, the distinction between formative and summative is still important, but the gap can be narrowed, and an assessment could be thoughtfully designed to serve both purposes, perhaps especially if it is an 'alternative' assessment rather than a traditional timed test. By aligning all assessments with the LOs, we can ideally ensure that both kinds -- summative and formative -- are pulling stakeholders in the same direction rather than opposing ones, and promote positive washback.

I've really only just started thinking about these concepts (and what they mean in relation to my own research), so any thoughts you might have on this are very welcome :)

Formative assessment

3 min read

What is assessment? While we often use “test” and “assessment” interchangeably, it’s important to differentiate the two. A test is an assessment, but an assessment isn’t necessarily a test. Tests are usually timed and result in marks or grades. Assessments can take many other forms, however.

Hill and McNamara (2012) talk about assessment opportunities, which they define as ‘any actions, interactions or artifacts... which have the potential to provide information on the qualities of a learner’s... performance’. It’s important to note that these can be unplanned, unconscious and embedded, and therefore can take place anytime in class, and these days, out of class as well.

Assessment opportunities are particularly useful for formative assessment. Black and Wiliam, who have written extensively on this topic, say that assessment is formative only if the evidence about student achievement obtained is actually used to make decisions about the next steps in instruction.

Formative assessment is often known as Assessment for Learning. The Assessment Reform Group came up with this diagram (above) to illustrate the importance of formative assessment. I think it shows the different dimensions of formative assessment very well. I particularly like the point about developing the capacity for self-assessment, which is critical to the development of self-directed learners. In their definition of AfL, the 3 aims are to find out where the learners are, where they need to go, and how best to get there.

Wiliam usefully unpacks formative assessment in the chart above, which shows us the respective roles of teacher, peer and learner in achieving the 3 aims I’ve just mentioned. As you can see, formative assessment, done right, ought to cultivate active and collaborative learners.

So what’s the difference between formative assessment and its opposite, summative assessment? In a nutshell, they have different functions and result in different things. Summative assessment is used to rank or certify, and for accountability purposes, while formative assessment is actually used to meet learner needs. Summative assessment typically ends with grades or marks, while formative assessment produces feedback for the learner instead.

Black and Wiliam have noted that when students are given both, they tend to ignore feedback and focus solely on their grades or marks. This is a habit that’s hard to break, and makes marks and grades doubly un-useful for learners.

What are some other reasons formative assessment is important? Black and Wiliam have reported significant learning gains as a result, noting that it helps low achievers in particular.

So often, however, teachers think of formative assessments as little tests that result in marks or grades, which don’t tell the teachers nor the students much about the learning that’s going on, or what to do next.

Formative assessment can be embedded into our class activities. Take a look at this page by the Northwest Evaluation Association for some ideas.

What formative assessment activities do you use? How do you and your students use them to inform teaching and learning? Please share with us on Twitter

Designing tests

1 min read

I'm cheating a bit this week by posting a set of slides adapted from the one I used for my class. (This cycle will be a bit different if designing alternative and/or formative assessments.)


2 min read

Even if you are not familiar with the term, you are probably familiar with the concept of washback (commonly called backwash in educational assessment). It refers to the effects of assessment on teaching and learning, and anyone who's studied in an exam-oriented system would have experienced this.

We tend to think poorly of washback because we often think of negative washback, e.g. ignoring what's in the syllabus in favour of what will be in the exam, even if we think that the syllabus has more worthy learning outcomes. While washback can be very problematic, I think we do need to consider two things.

First, as long as high-stakes exams determine a person's educational prospects, it's pretty unfair to blame teachers (and parents and learners) for their preoccupation with preparing students for exams. I don't mean to say that teachers etc. should willingly let exams lead them by the nose, and applaud those who can look beyond exams to think and act with true education in mind. However, we would be doing our students a disservice if we didn't prepare them adequately for exams (think face validity and student-related reliability). The point is not to obsess over them and let them overrun the curriculum.

Second, washback can be positive, and we should try to leverage this. While national exams are not within our control (though we may be able to exert some subtle influence), classroom assessments are -- make sure these are aligned with our intended learning outcomes. I believe that real learning will serve students well in their exams, and that obsessive exam prepping is unnecessary.

How do you deal with washback? Let us know on Twitter with


1 min read

Authenticity is about the closeness of your assessment task to a real-world task. This seems quite straightforward, until you consider that in the real world, few tasks demand only language proficiency, and not also non-language related knowledge and competencies.

So how authentic can a language test get? Brown and Abeyrickrama (2010) list a few qualities:

  • language that is as natural as possible
  • contextualised items
  • meaningful, relevant, interesting topics (although it's worth considering that meaningful, relevant, interesting to us may not be meaningful, relevant, interesting to students)
  • some thematic organisation to items, e.g. through a storyline
  • 'real-world' tasks (which could also be questionable -- do language teachers necessarily have an accurate sense of the authenticity of tasks?) 

It's possible that what we need for optimal authenticity are 'integrated' assessment tasks that combine different subjects in the curriculum, instead of language on its own.

What do you think? What sort of authentic assessment tasks do you use? Let me know with on Twitter.


2 min read

If you've followed this series so far, you might be thinking that wow it's hard to make a test reliable and valid -- too hard!

Well actually the first principle of language assessment discussed in Brown and Abeywickrama (2010) is 'practicality'. You could design the most reliable and valid test in the world, but if it's not practical to carry out, you know it isn't going to happen the way you planned it. My take on this is that we can try our best to be reliable and valid in our assessment, but also be realistic about what is achievable given limited resources.

For instance, an elaborate rubric might be more reliable to mark with, but if it's too complex to use easily and you have a big group of students, you might not use it the way it's intended, because it just takes too much time to mark one script. As a result, reliability suffers, because different teachers end up handling the complexity of the rubric in different ways.

Another example: we know that double marking is more reliable, but we also recognise that double marking every script of every test is just not feasible. In such a case, we have to make other efforts at maximising reliability.

Having said this, I think we can sometimes think of creative ways to maximise reliability and validity while still being realistic about what is doable. Take for instance standardisation meetings, which can be a drag because they take up so much valuable time. As I mentioned before, markers can be given the scripts prior to the meeting to mark at home, or they might even discuss the scripts online (e.g. by annotating them on shared Google Docs). I believe that technology can offer ways to make test administration more reliable and valid in more effective and efficient ways, and we should not therefore immediately discard a possible measure because of its perceived impracticality.

Have you got tips and strategies to maximise reliability and validity more efficiently? Please share on !

Face validity

1 min read

So far we haven't considered the test-taker's point of view. Face validity refers to exactly this: does the test look right and fair to the student?

Of course, one might argue that students are not usually the best judge of validity. But their opinion, however flawed, can affect their performance. You want students to be confident and low in anxiety when taking a test, because you want to maximise student-related reliability, as mentioned in an earlier post.

Brown and Abeywickrama (2010) advise teachers to use:

  • a well-constructed, expected format with familiar tasks
  • tasks that can be accomplished within an allotted time limit
  • items that are clear and uncomplicated
  • directions that are crystal clear
  • tasks that have been rehearsed in their previous course work
  • tasks that relate to their course work (content validity)
  • a difficulty level that presents a reasonable challenge

(p. 35)

As always, please share your thoughts on .

Construct validity

3 min read

This post is a bit challenging to write, partly because the concept of 'construct' is hard to explain (for me), and partly because construct validity is so central to discussions of validity in the literature.

When I started blogging about validity, I wrote that we can take the concept to mean asking the question 'does the test measure what it's supposed to measure?' We can now think a bit further as to what is actually being measured by tests. A test can only measure things that can be observed.

Say we are attempting to figure out a student's writing ability (maybe your typical school composition kind of writing). We can't actually directly measure your construct -- that mysterious, abstract ability called 'writing' -- but we do have an idea as to what it looks like. To try to fully assess it we might look at all the things that make up the ability we know as 'writing'. These are the kind of things that you will find in your marking rubric (they are there because we think that they are signs a person is good or bad at writing): organisation, grammar, vocabulary, punctuation, spelling, etc.

So we look at what we are measuring when we assess writing, and ask ourselves if these things do indeed comprehensively make up the ability we know as 'writing'. Is anything missing (construct underrepresentation)? Is there anything there that shouldn't be there because it has nothing to do with writing per se (construct irrelevance)? Imagine a writing test that didn't include marking for 'grammar', or one that required you to do a lot of difficult reading before you write. Certainly you can test writing in either of these ways, but you'd need to be clear as to what your construct is, how it differs from the more commonly understood construct of 'writing' and why. You could argue for a construct of reading+writing based on research findings, for example.

What I've written above is probably a gross over-simplification (maybe more so than usual). If you'd like a more technical explanation, I recommend JD Brown's article for JALT. It isn't long, and I love how it even includes, even if briefly, Messick's model of validity. This model is so important to our understanding of testing that I'm going to include here McNamara and Roever's (2006) interpretation of the model, in the hope that it might give you some food for thought over the long LNY weekend ;-)

Source: McNamara (2010)

Criterion validity

3 min read

This work by frankleleon is licensed under a Creative Commons Attribution 2.0 Generic Licence

Okay, so you've designed a test and you've decided that if the students reach a certain mark or grade (or meet certain criteria), they have achieved the learning outcomes you're after. But are you really sure? How can you know? This is essentially the question we aim to answer when we consider criterion validity.

We can consider two aspects of criterion validity: concurrent validity and predictive validity.

To establish concurrent validity, we assess students in another way for the same outcomes, to see for example if those who performed well in the first assessment really have that level of proficiency. In my previous post on content validity, I gave the example of an MCQ grammar test vs an oral interview speaking test, to measure grammatical accuracy in speaking. To check the concurrent validity of the MCQ test, you could administer both tests to the same group of students, and see how well the two sets of scores correlate. (This does assume you are confident of the validity of the speaking test!) In a low stakes classroom testing situation, you might not have the time to administer another test, but you could for instance call up a few students for a short talk, and check their grammatical accuracy that way. You might pick the students who are borderline passes -- this could show you whether your pass mark is justified.

As for predictive validity, this is really more important when the test scores determine the placement of the student. Singapore schools typically practise streaming and/or banding to place students with others of the same level. If the test we use to determine their placement does not have predictive validity, that means there is a good chance the student would not be successful in that group. Which kind of defeats the purpose of streaming/banding! We can't predict the future, but we can compare past and future performances. We could for instance compare the test scores of students a few months into their new placement with the test scores we used to determine their placement. If there are students who perform much better or poorer than you would reasonably expect, it's time to re-examine the original test, and probably move the students to a more suitable class too.

That's about it for criterion validity. As always, tweet your comments and questions with .

Content validity

2 min read

This work by Nevit Dilmen is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported Licence.

This week we turn to validity. It can be a tricky concept and in fact was something that took me some time to 'get' at first. The easiest approach I've found to interpreting 'validity' is to ask the question: does the test measure what it's supposed to measure?

What do we want it to measure? Do we know what we want to measure, in the first place? So often I think we plan assessment without being perfectly clear about our purpose. (Sometimes it's because our learning outcomes aren't very clear to begin with.) Brown and Abeywickrama (2010) list other qualities of a valid test but I think the above definition is enough to work with for now.

Following the same book, I'm starting with content-related validity. This is pretty straightforward: is what you want to test in the test? This might seem kind of 'duh' but it's actually a trap that's quite easy to fall into. For instance, our purpose might be to test learners' grammatical accuracy when speaking, but instead of actually getting them to speak, we set an MCQ grammar test. The former would be a direct test, while the latter (arguably) an indirect test of the same thing. Indirect tests are often used for reasons of practicality and reliability; obviously it's much easier to mark a class's MCQ test (it could even be done automatically) than to administer an individual oral test for each student.

If it really isn't possible to achieve high content validity, then we've got to look into the other validities of our test. More of those in the coming weeks. In the meantime, keep your questions and comments coming with on Twitter.