Book Review: Language assessment in practice: Developing language assessments and justifying their use in the real world

Abstract

Language Assessment in Practice, by Lyle Bachman and Adrian Palmer, presents a comprehensive framework for evaluating the usefulness of assessment systems. The core concept in the book, the Assessment Use Argument (AUA), is quite general and provides a framework for the development and evaluation, or justification, of essentially any assessment program. The examples and projects involve language assessment, but the framework could easily be extended to a wide range of assessments.

My discussion of this book (like Caesar’s Gaul) is divided into three parts. In the first part of my review, I will discuss three separate conceptual traditions that the book draws on. First, the authors take test use and the consequences of use as their core concern, rather than the interpretations of test scores per se, or the technical qualities of the assessments. Second, the book adopts an argument-based framework for its development. Third, although the framework allows for a wide range of test characteristics, score interpretations and uses, the book reflects a preference for the use of assessment tasks that reflect the complexity of the language-use tasks that are of interest.

In the second part of my review, I will make a few remarks about the book’s structure and content. This section represents the common core of all book reviews, and discusses the design and potential uses of the book. The authors seem to have had at least three general goals in mind in writing the book. It can serve as a monograph for developing a framework for language assessment development and justification, as a practical guide for assessment developers, and as an advanced textbook. It does pretty well on all three fronts, but as is usually the case for projects with multiple potential uses, the fit is not perfect for any of them.

In the third part of my review, I will mention two related problems that probably can’t be solved, but perhaps can be ameliorated, and in any case, are worth acknowledging and talking about from time to time. The two problems are the tendency for different but related research areas to diverge over time, in response to different contexts and different problems or goals, and in so doing, to develop differences in terminology, which are a result of the separation and a cause of further separation. As a result, as suggested in Cool Hand Luke, we may have ‘a failure to communicate.’

Some conceptual antecedents

In the interest of full disclosure, I should mention that my background has been mainly in educational assessment and in licensure and certification assessment, with limited experience in language assessment per se. My background necessarily limits my perspective to some extent, but divergent views can be informative.

As I read the book, I noted echoes from three distinct traditions in assessment: a focus on uses and consequences, an argument-based framework, and extensive reliance on the analyses of tasks and context to buttress the claims to be made about assessment uses. The book provides a new framework for evaluating language assessment programs, but a recognition of these conceptual antecedents can put the work in a historical context.

First, the authors make the uses of assessments and the consequences resulting from these uses the central organizing framework for the book and for their framework for assessment development and justification. The recognition that assessments should be evaluated in terms of how well they work in practice, and not just in terms of technical characteristics or the plausibility of proposed interpretations, goes back a long way, but Bachman and Palmer make it the centerpiece of their framework.

Assessments have generally been evaluated in terms of their reliability and validity, with reliability providing an indication of the consistency or stability of assessment scores, and validity addressing the meaning and utility of the scores. For the first half of the twentieth century, validity tended to emphasize the usefulness of the scores for particular purposes (e.g. predicting some criterion). In the first edition of Educational Measurement, Cureton (1951) defined validity in terms of test use:

The essential question of test validity is how well a test does the job it is employed to do. The same test may be used for several different purposes, and its validity may be high for one, moderate for another and low for a third. (p. 621)

Cronbach and Meehl’s (1955) watershed article on construct validity focused on interpretations. Messick (1981, 1988, 1989, 1994) and Cronbach (1971, 1988) redressed the balance; they gave a lot of attention to interpretations, but also emphasized the consequences resulting from various uses of assessment.

To the extent that validity is defined in terms of how well an assessment program achieves its goals, it necessarily includes some attention to consequences, positive and negative. The role of consequences in validity is controversial. Some authors (Borsboom, Mellenbergh, & van Heerden, 2004; Popham, 1997; Sackett, 1998) have argued for a focus on a semantic interpretation of scores, to the exclusion of most consequences. Others (Messick 1989, 1994; Linn, 1997; Shepard, 1997; Kane, 2006) have advocated for a conception of validity involving both the meaning of assessment scores and the consequences of their use.

There is no ambiguity about where Bachman and Palmer stand on this issue. On page 2, they make it clear that, ‘All of the various components of this book are tied in to a structure for justifying assessment use’. They go on to say that:

At the core of our approach is an Assessment Use Argument (AUA), which guides the assessment development process. The AUA consists of a set of claims that specify the conceptual links between a test taker’s performance on an assessment, an assessment record, which is the score or qualitative description we obtain from the assessment, an interpretation about the ability we want to assess, the decisions that are to be made, and the consequences of using the assessment and of the decisions that are made. (p. 30)

Bachman and Palmer subsume traditional notions of reliability and validity under the AUA, and maintain that: ‘Assessment justification consists of articulating an Assessment Use Argument (AUA) and collecting evidence to support this’ (p. 30).

They are decisive in opting for a framework that emphasizes the evaluation of assessment uses in terms of their consequences. Given past experience, it seems unlikely that everyone will hop on their bandwagon, but they provide a clear path to follow in this regard.

Second, in organizing their discussion around the role of AUAs, the authors adopt an argument-based framework for the development of the framework. Following Toulmin’s analysis of practical reasoning, the AUA is to ‘include the following elements: data, claims, warrants, backing, rebuttals, and rebuttal backing’ (p. 99).

Argument-based approaches to validation (Cronbach, 1971, 1988; House, 1980; Kane, 1992, 2006; Shepard, 1993) suggest that the claims being made in a score interpretation and use be stated clearly (e.g. in the form of an explicit interpretive argument) and that these claims be critically evaluated.

Bachman and Palmer follow in this tradition and extend it in several ways. They base their approach to language assessment development and use fundamentally on ‘the need for a clearly articulated and coherent Assessment Use Argument (AUA)’ and on ‘the provision of evidence to support the statements in the AUA’ (p. 31). They adopt the framework and terminology of an argument-based approach, but they do not emphasize the validity of a proposed interpretation per se. Rather, they are concerned with the general question of the justification for assessment uses, with the justification of proposed interpretations constituting one of several major claims in the AUA.

Argument-based approaches to validity have generally attended to both the interpretations and the uses of assessment scores (Kane, 2006; Shepard, 1993), and, as a result, have attended to the consequences of particular uses as well as interpretive frameworks, but Bachman and Palmer make score uses and the consequences of score uses the centerpiece of their discussion: ‘An AUA provides the conceptual framework for linking a claim about a particular set of consequences to the performance of individuals on a language assessment’ (p. 156). Given the high stakes of many emerging uses of assessment systems (e.g. in school and teacher accountability, in employment and immigration decisions), the analysis of consequences in justifying assessment programs is becoming increasingly important.

The third antecedent is the growing interest in performance, or authentic, assessment over the last 30 years; the authors devote a lot of attention to the fidelity of assessment tasks to the language use tasks that are of primary interest given the proposed use of the assessment. Their framework emphasizes the development of assessments that are authentic in the sense that they require the kinds of performance that are of interest.

Bachman and Palmer treat language use as a complex and highly interactive activity, involving the ‘negotiation of intended meanings between two or more individuals in a particular situation’ (p. 34). They define language ability as a ‘capacity that enables language users to create and interpret discourse’ (p. 33) and suggest that it includes two components: language knowledge and strategic competence, but they also recognize a need to consider, ‘personal attributes, topical knowledge, affective schemata, and cognitive strategies’ (p. 33).

Bachman and Palmer define language use tasks as elemental activities of language use, and they define target language use (TLU) tasks as language use tasks in some specific setting outside of the assessment itself. They suggest that:

Language use can be viewed as performance of a series of language use tasks that are interrelated, in terms of the setting, the communicative goal to be achieved, and the participants. A language assessment task can be thought of as a procedure for eliciting responses from which inferences can be made about an individual’s language ability. It therefore follows that in order for such inferences to be made, a language assessment should consist of language use tasks. In designing language assessments whose use we can justify, it is important to include tasks whose characteristics correspond to those of TLU tasks. (p. 63)

To the extent possible, it seems that Bachman and Palmer would design the assessment as a sample of performances from the TLU domain.

This approach has some clear advantages, which Bachman and Palmer discuss at some length, but it can also entail some significant disadvantages (Widdowson, 2001). Rich performance tasks tend to require a significant period of time to complete, and therefore the sampling of tasks from the TLU domain is likely to be small and potentially not representative of the domain as a whole. The art of assessment development is to balance the need for adequate sampling of the domain and consistency in scores across replications of the assessment with the need for tasks that are as authentic as possible.

Structure and content of the book

This book could serve at least three general functions. It is basically a monograph in which a framework for the development and justification of language assessments is developed in some detail. The first five chapters lay out the framework and criteria for justifying assessment programs. Chapters 6 to 12 provide a description of how to develop language assessments and associated AUAs. The third part of the book, Chapters 13 to 21, discusses the development and use of language assessments in real-world contexts. As discussed above, the framework has roots in educational assessment and in language assessment, and it provides a rich framework for assessment development and justification with a strong focus on language assessment use and on consequences.

The monograph can serve as a guide to assessment development. It provides a number of examples of language use that are carried through the book, and it discusses the claims made in language assessments in considerable detail. It does not provide a cookbook approach for assessment development, but it does include lists of claims, warrants, and rebuttals that are likely to come up or be encountered in developing language assessments, and provides detailed advice on how to deal with the many issues that are likely to arise in assessment development. In some cases, the detail, which would be useful for assessment development, slows the general development down a bit, but a reader always gets to pick and choose how they will read a book; in a section of the first chapter, the authors say of the book that they ‘cannot imagine anyone sitting down and reading it cover to cover’ (p. 15) and then give some advice on how to get the most out of their work.

Third, the book could provide the basis for an advanced course in language assessment. In addition to the development and discussion of the framework, the book includes extended examples and a set of project materials on a website. The book is set up to facilitate its use in a language assessment course.

Communication across disciplines and sub-disciplines

There is a natural tendency for distinct but related research areas to diverge over time, developing differences in terminology and methodology as a result of differences in contexts and goals. Most of this differentiation is necessary and desirable, as new models and methods are developed to meet specific needs in different contexts, but some of the differences may cause unnecessary confusion. The differences are particularly problematic if the same term is used to describe different concepts or methods, or different terms are used to label a common concept or method.

There was one set of terms on which I would have preferred that Bachman and Palmer had made different choices. They define generalizability as ‘the degree of correspondence between a given language assessment task and a TLU task in their task characteristics’ (p. 117), and they define consistency as the extent to which, ‘the assessment records are consistent across different characteristics of the assessment situation, such as different assessment tasks, different forms of the assessment, different assessors, or different times of assessment’ (p. 126).

However, there is a well-established theoretical framework, developed by Lee Cronbach and his associates, for the kinds of precision/reliability/generalizability issues considered under the heading of consistency in the AUA framework. This framework, Generalizability Theory (Brennan, 2001), goes back over 40 years, and I think that the difference between Bachman and Palmer’s usage of the term ‘generalizability’ and the usage of this term in Generalizability Theory invites confusion. Bachman and Palmer have proposed a new framework, which necessarily requires new terminology, but I suggest that if they revise the book or write a sequel, they consider changing their usage of the term ‘generalizability’.

A second labeling issue, which could also generate some ambiguity, involves the term ‘warrant’. Bachman and Palmer (2010) define warrants as, ‘statements that elaborate the qualities of a claim’ (p. 161), and they have many warrants associated with some claims. As described by Kane (1992, 2006) and Mislevy, Steinberg, and Almond (2003), Toulmin (1958) treats warrants as rules of inference that justify the adoption of a claim, based on data, or as ‘general, hypothetical statements, which can act as bridges, and authorize the sort of step to which our particular argument commits us’ (p. 98). Toulmin’s usage may be less widespread in educational measurement than the usage of ‘generalizability’, but the reader of various argument-based analyses should be aware of the variability.

Concluding remarks

I expect that a wide range of individuals who are involved in language assessment, and more generally, in educational assessment, will find this book thought provoking and helpful in the practice of assessment development and justification. It is an important contribution to the development of argument-based approaches to assessment development and justification.

References

Bachman

(2002). Alternative interpretations of alternative assessments: Some validity issues in educational performance assessments. Educational Measurement: Issues and Practice, 21(3), 5–18.

Borsboom

Mellenbergh

van Heerden

(2004). The concept of validity. Psychological Review, 111, 1061–1071.

Brennan

(2001). Generalizability theory. New York: Springer-Verlag.

Cronbach

L. J.

(1971). Test validation. In Thorndike

R. L.

(Ed.), Educational measurement, 2nd ed. (pp. 443–507). Washington, DC: American Council on Education.

Cronbach

L. J.

(1988). Five perspectives on validity argument. In Wainer

Braun

(Eds.), Test validity (pp. 3–17). Hillsdale, NJ: Lawrence Erlbaum.

Cronbach

L. J.

Meehl

P. E.

(1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302.

Cureton

E. E.

(1951). Validity. In Lindquist

E. F.

(Ed.), Educational measurement (pp. 621–694). Washington, DC: American Council on Education.

House

E. R.

(1980). Evaluating with validity. Beverly Hills, CA: Sage Publications.

Kane

(1992). An argument-based approach to validation. Psychological Bulletin, 112, 527–535.

10.

Kane

(2006). Validation. In Brennan

(Ed.), Educational measurement, 4th ed. (pp. 17–64). Westport, CT: American Council on Education and Praeger.

11.

Kunnan

A. J.

(2000). Fairness and justice for all. In Kunnan

A. J.

(Ed.), Fairness and validation in language assessment (pp. 1–14). Cambridge: Cambridge University Press.

12.

Linn

R. L.

(1997). Evaluating the validity of assessments: The consequences of use. Educational Measurement: Issues and Practice, 16(2), 14–16.

13.

Linn

R. L.

(1998). Partitioning responsibility for the evaluation of the consequences of assessment programs. Educational Measurement: Issues and Practice, 17(2), 28–30.

14.

Messick

(1981). Evidence and ethics in the evaluation of tests. Educational Researcher, 10, 9–20.

15.

Messick

(1988). The once and future issues of validity. Assessing the meaning and consequences of measurement. In Wainer

Braun

(Eds.), Test validity (pp. 33–45). Hillsdale, NJ: Lawrence Erlbaum.

16.

Messick

(1989). Validity. In Linn

R. L.

(Ed.), Educational measurement, 3rd ed. (pp. 13–103). New York: American Council on Education and Macmillan.

17.

Messick

(1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23, 13–23.

18.

Mislevy

Steinberg

Almond

(2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1, 3–62.

19.

Popham

W. J.

(1997). Consequential validity: Right concern – wrong concept. Educational Measurement: Issues and Practice, 16(2), 9–13.

20.

Sackett

P. R.

(1998). Performance assessment in education and professional certification: Lessons for personnel selection? In Hakel

M. D.

(Ed.) Beyond multiple choice: Evaluating alternatives for traditional testing for selection (pp. 113–129). Mahwah, NJ: Lawrence Erlbaum.

21.

Shepard

L. A.

(1993). Evaluating test validity. In Darling-Hammond

(Ed.), Review of research in education, Vol. 19 (pp. 405–450). Washington, DC: American Educational Research Association.

22.

Shepard

L. A.

(1997). The centrality of test use and consequences for test validity. Educational Measurement: Issues and Practice, 16(2), 5–8, 13, 24.

23.

Toulmin

(1958). The uses of argument. Cambridge: Cambridge University Press.

24.

Widdowson

H. G.

(2001). Communicative language assessment: The art of the possible. In Elder

Brown

Grove

Hill

Iwashita

Lumley

McNamara

O’Loughlin

(Eds.), Experimenting with uncertainty: Essays in honor of Alan Davies (pp. 12–21). Cambridge: Cambridge University Press.