Towards a new sophistication in vocabulary assessment

Abstract

Published work on vocabulary assessment has grown substantially in the last 10 years, but it is still somewhat outside the mainstream of the field. There has been a recent call for those developing vocabulary tests to apply professional standards to their work, especially in validating their instruments for specified purposes before releasing them for widespread use. A great deal of work on vocabulary assessment can be seen in terms of the somewhat problematic distinction between breadth and depth of vocabulary knowledge. Breadth refers to assessing vocabulary size, based on a large sample of words from a frequency list. New research is raising questions about the suitability of word frequency norms derived from large corpora, the choice of the word family as the unit of analysis, the selection of appropriate test formats, and the role of guessing in test-taker performance. Depth of knowledge goes beyond the basic form-meaning link to consider other aspects of word knowledge. The concept of word association has played a dominant role in the design of such tests, but there is a need to create test formats to assess knowledge of word parts as well as a range of multi-word items apart from collocation.

Keywords

Depth of vocabulary knowledge vocabulary assessment vocabulary size vocabulary test validation word frequency

In my book Assessing vocabulary (Read, 2000), I observed that there was a gap between the mainstream field of language assessment, with its focus on large-scale proficiency and achievement tests composed of skills-based communicative tasks, and work on the design and administration of vocabulary tests, with its preoccupation with declarative knowledge of discrete lexical elements of the language. The latter could be seen as an anachronistic holdover from the discrete-point approach to language testing associated with the pioneering work of Robert Lado, but on the contrary, there was a continuing interest in developing and using vocabulary tests, particularly among researchers interested in tracking the acquisition of vocabulary knowledge, and among language teachers, who saw the expansion of vocabulary as one of the most important challenges facing their learners.

I proposed a simple framework to look at these two approaches in an integrated way. It characterised conventional vocabulary tests as being based on a discrete construct of vocabulary knowledge; a selective focus on particular vocabulary items; and test item formats that were relatively context-independent. By contrast, such tests could be complemented by more communicatively oriented assessments that were embedded in a larger construct, such as reading comprehension or academic speaking ability; were comprehensive in the sense of involving all the lexical content of input and output texts; and were context-dependent to the extent that test-takers were required to take account of relevant context in their responses. There was no expectation that the second approach would eventually displace the first one, and in fact both approaches are thriving more than ever. Vocabulary assessment is still seen primarily in terms of the development and use of tests that are discrete, selective, and context-independent in nature, and I will focus on issues related to such tests here. However, I note in passing the substantial advances in measuring the lexical richness of texts (Kyle, 2020) and the central role of lexical statistics in automated scoring systems for writing and speaking tasks—both representing a new level of progress in applying embedded vocabulary assessment.

Considering the state of the art in discrete vocabulary tests, there are still relatively few language testers who can be regarded primarily as specialists in vocabulary assessment. More common are researchers on second-language vocabulary who have written tests either for use as post-tests for projects on vocabulary acquisition or as more generic measures for specific aspects of vocabulary knowledge. In addition, a small number of vocabulary tests have tended to dominate the attention of both researchers and practitioners, notably Paul Nation’s Vocabulary Levels Test and his subsequent Vocabulary Size Test, among some others. These tests have tended to be tried out and evaluated on a relatively small scale, with limited evidence of reliability and validity before being released for widespread use, with the implication that they could be used as generic instruments, with little reference to the learners’ L1, their educational context or the specific uses of the test results. Two vocabulary research manuals published a decade ago (Nation & Webb, 2010; Schmitt, 2010) offer valuable advice on the selection and design of tests for use in vocabulary research studies.

However, it is only recently that Schmitt et al. (2020) have set out an agenda for the future in terms of ensuring the quality and appropriateness of second-language vocabulary tests. This includes an explicit specification of the purpose of the test, as well as appropriate contexts and types of learners; systematic procedures for test development; the formulation of a structured case for test validity within a framework such as an assessment use argument; and not releasing a test for general use before it has been adequately validated for specific uses. Many of the points these authors make are simply current best practices from the perspective of language assessment specialists, but their agenda is a measure of how much vocabulary testing has lagged behind the mainstream of the field in that such guidelines need to be given.

Breadth of knowledge: Vocabulary size

An enduring metaphorical distinction in vocabulary studies is between breadth and depth of word knowledge. Let us look first at assessing breadth: it is typically glossed as vocabulary size, meaning some estimate of how many words are known by a learner—or any other user of the language. Given the open-ended and dynamic nature of vocabulary acquisition for anyone who is actively using a language like English, it is almost impossible to define with any accuracy the domain of vocabulary knowledge in absolute terms, especially for those who have progressed to an intermediate level of proficiency in the language or beyond. There are too many academic disciplines, technical fields, everyday spheres of interest, regional variants, colloquial terms, and so on. Advances in corpus analysis have made available a range of corpora reflecting this lexical diversity, but in practice, vocabulary size tests are mostly based on word frequency lists derived from very large computer corpora, such as the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA).

Size tests are developed by taking a sample of words from the list and using a simple multiple-choice item format to obtain evidence as to whether each word is known or not. This was the basis for the design of Nation’s Vocabulary Size Test, which was launched around 2007 and subsequently developed into a family of tests, with versions sampling different frequency ranges and providing word definitions for particular L1s (see some of Nation’s work compiled here: https://www.wgtn.ac.nz/lals/resources/paul-nations-resources/vocabulary-tests). It was intended to serve as a measure of vocabulary size both for native speakers and for learners, but has proved to be somewhat problematic for less advanced learners, who are confronted with a high proportion of unknown words, some of which they may happen to know on a hit-or-miss basis.

Thus, a distinction has come to be recognised between a size test and a levels test. The latter type of test narrows the focus to a certain range of vocabulary, normally the higher-frequency words that learners are most likely to know and need. A good current example is the Updated Vocabulary Levels Test (Webb et al., 2017), which covers the first 5000 words of English. The high-frequency vocabulary represents a more realistic domain of vocabulary for beginning to intermediate learners and it means that a larger sample of words from the targeted vocabulary range can be included in the test. Thus, a levels test has more potential as a pedagogical measure for purposes such as planning classroom vocabulary activities and selecting materials that are suitable for the learners’ vocabulary level. In keeping with such purposes, it makes more sense to analyse and report the scores on a criterion-referenced basis, in terms of whether the learner has achieved mastery at each of the frequency levels in the test.

One enduring assumption in relation to word frequency is that the more frequent a word is, the more likely it is to be known by learners, but it needs to be qualified in various ways. It has long been recognised that a learner’s first language has an influence on English L2 vocabulary knowledge, particularly for speakers of European languages that share with English an extensive stock of cognates and words borrowed in both directions historically. With the modern dominance of English as an international language, loanwords from English are pervasive in the world’s languages. In addition, learners are exposed to English words on the web through social media, gaming, streamed films and television series, international travel, and numerous other channels. There is something of a mismatch, then, between word frequency norms based on large corpora compiled from sources written by native speakers, and those that reflect learners’ actual knowledge of words.

This disparity has been investigated in two recent projects. The first one, sponsored by the British Council (2022), involved a large-scale study of learners who were speakers of Spanish, German and Chinese. They were tested according to a fairly demanding criterion: being able to recall the form of an English word with the correct spelling, when prompted by a simple definition of its meaning. Not only were the resulting word lists for the three language groups quite different among themselves, but also these knowledge-based frequencies of individual words often diverged quite substantially from their frequency in the COCA. A similar finding came from a study by Brysbaert et al. (2021), which focussed on what they call “word prevalence norms” (p. 210). They adopted a crowdsourcing approach by inviting learners to respond on the web to a short test of randomly selected English words from a huge database of more than 60,000 words. The test consisted of a simple Yes/No lexical decision task, with a control for guessing. The researchers identified 445 words that were known by virtually all the respondents and classified them on a bottom-up basis to show a variety of influences on word prevalence beyond simple frequency in the language. Another element in their analysis was reaction time: a measure of fluency of access to known words that has received little attention until recently in mainstream L2 vocabulary studies, but which needs much further investigation.

There are other aspects of vocabulary size testing that are coming under critical scrutiny, as highlighted in a recent exchange between Stoeckel et al. (2021) and Webb (2021). One issue involves the inferences that can be drawn from test results according to the choice of test format. Size tests have typically used selected-response items, such as yes/no (as in the Brysbaert et al. study above), multiple-choice, and multiple matching, whose practicality is offset by being susceptible to the effects of guessing. This raises concerns about the reliability of vocabulary size estimates, and also what level of word knowledge is required for particular purposes, such as reading comprehension or writing. Clearly, the form recall task used in the British Council (2022) study, which requires a constructed response, is more demanding than a yes/no judgement, but the question is whether that level of knowledge is needed for reading. There is also debate about the extent to which a sample of words can represent a particular word frequency level, both in terms of reliability and the diversity of the individual words involved. And a further issue—what we mean by a “word” in this context—is taken up below.

In short, vocabulary size tests continue to play a significant role in language teaching research and practice, and the questioning of established procedures for designing and interpreting them is to be welcomed as a prelude to the development of more sophisticated measures.

Depth of knowledge

Let us move now to the construct of depth of knowledge, which I have written about extensively. The word associates format that I created in 1990 is still widely regarded as a standard way to measure depth. However, the concept remains rather fuzzy and it is difficult to say that our thinking about it has developed much beyond what I wrote in a book chapter nearly 20 years ago (Read, 2004). Its starting point is the observation that vocabulary size testing focusses on the ability of learners to make a link between the form of a word and its meaning, but that represents just one aspect of word knowledge. Nation’s (2013) classification of what it means to know a word is the most widely cited specification, encompassing knowledge of spelling and pronunciation, morphology, meanings and uses, associations with other words, collocational possibilities, and contexts of use. There is too much here to assess in a single test, at least if any significant number of target words are to be covered. Thus, it is necessary to be selective. The word associates format focussed just on the concept of word association, with particular reference to semantic relations between words and their collocations.

The format has figured prominently in the numerous studies that have set out to compare measures of vocabulary size and depth. In a comprehensive review of the research, Schmitt (2014) concluded that there was little difference between the two for high-frequency words and for learners with limited vocabulary knowledge, but there could be a gap as learners acquired more low-frequency words. It remains an open question whether for most learners knowledge of words naturally deepens as their vocabulary size and overall language proficiency increase. Another issue noted by Schmitt (2014) was the lack of validated instruments to measure what the researchers understood by the concept of depth. Too often those who see vocabulary depth as a worthwhile focus for research rely on one of the two word associates tests that I published in the 1990s. Even though there have been extensive investigations of this test format since then (see Zhang & Koda, 2017, for a review), no standardised test has emerged that could be used for pedagogical purposes (as distinct from research), although, as noted previously at the outset of this article, it would not be desirable to promote a single generic test of this kind.

One component of word knowledge which in a sense bridges the gap between size and depth is knowledge of word parts (morphology), and more particularly inflectional and derivational suffixes. Underlying both the Vocabulary Levels Test and the Vocabulary Size Test is the concept of the word family, consisting of a head word (or stem form) together with its inflected forms and relatively frequent, regular, and transparent derivatives. Thus, for example, the compose word family consists of the inflected forms composed and composing as well as the derivatives composer, composers, composition, compositions, and compositional. Although the stem form is normally the target item in a vocabulary test, it is assumed that, with basic morphological awareness, learners can be credited with knowledge of the other members of the family as well. However, there is increasing research evidence that often learners are unable to connect word family members receptively or to generate the derived forms productively. In addition, the head word is not necessarily the most familiar or frequent member. For these reasons, many scholars now prefer to use the lemma—the stem form plus its inflections—as the basic unit of analysis and as the basis for sampling words for vocabulary tests. In addition, assessing knowledge of word family relationships is a priority for vocabulary depth testing.

Another aspect of depth goes beyond individual words to consider knowledge of multi-word units (MWUs), encompassing compound forms, phrasal verbs, collocations, lexical phrases, idioms, formulaic expressions, and lexical bundles, among other categories. Some scholars use the term “collocation” to refer to MWUs collectively, but it is common to apply it more restrictively to adjective–noun, verb–noun, adverb–adjective, noun–noun, and a few other lexical combinations. One reason undoubtedly is that these units can be accommodated most easily into existing formats for testing individual words, such as multiple-choice and gap-filling. Corpus analysis is delivering increasing numbers of lists of MWUs of various kinds, selected by frequency and (often) judgements of pedagogical value, but there is a lack of creative thinking about how to assess them. One constraint is that many MWUs perform discourse functions and thus require more than a single-sentence context to be meaningfully assessed. Thus, more embedded, rather than discrete, assessment formats may be appropriate once we move beyond collocations in the narrow sense.

Concluding comments

This has been a somewhat selective account of the state of the art in vocabulary assessment. Whereas previously this area was dominated by a small number of scholars who were not primarily language testers, researchers old and new are now questioning conventional principles of vocabulary test design, drawing on new tools available from corpus analysis and computer-based testing, and applying new rigour to the validation of tests. This involves in principle a move from generic tests promoted as useful for a variety of vaguely defined purposes to having tests tailored for specific uses with particular populations of learners. In practice, the development of a good vocabulary test requires a high level of competence in the language along with modern technical skills in test design and analysis, which may not be available locally. However, at least a wider range of tests should be available and it should be standard practice to try them out and obtain validating evidence before introducing them into new educational contexts.

Footnotes

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

John Read

References

British Council. (2022). Knowledge-based vocabulary lists (KVL). https://www.britishcouncil.org/exam/aptis/research/knowledge-based-vocabulary-lists-kvl

Brysbaert

Keuleers

Mandera

(2021). Which words do English non-native speakers know? New supranational levels based on yes/no decision. Second Language Research, 37(2), 207–231. https://doi.org/10.1177/0267658320934526

Kyle

(2020). Measuring lexical richness. In Webb

(Ed.), The Routledge handbook of vocabulary studies (pp. 454–476). Routledge.

Nation

I. S. P.

(2013). Learning vocabulary in another language (2nd ed.). Cambridge.

Nation

I. S. P.

Webb

(2010). Researching and analyzing vocabulary. Heinle.

Read

(2000). Assessing vocabulary. Cambridge.

Read

(2004). Plumbing the depths: How should the construct of vocabulary knowledge be defined? In Bogaards

Laufer

(Eds.), Vocabulary in a second language: Selection, acquisition and testing (pp. 209–227). John Benjamins.

Schmitt

(2010). Researching vocabulary: A vocabulary research manual. Palgrave Macmillan.

Schmitt

(2014). Size and depth of vocabulary knowledge: What the research shows. Language Learning, 64(4), 913–951. https://doi.org/10.1111/lang.1207

10.

Schmitt

Nation

Kremmel

(2020). Moving the field of vocabulary assessment forward: The need for more rigorous test development and validation. Language Teaching, 531(1), 109–120. https://doi.org/10.1017/S0261444819000326

11.

Stoeckel

McLean

Nation

(2021). Limitations of size and levels tests of written receptive vocabulary knowledge. Studies in Second Language Acquisition, 43(1), 181–203. https://doi.org/10.1017/S027226312000025X

12.

Webb

(2021). A different perspective on the limitations of size and levels tests of written receptive vocabulary knowledge. Studies in Second Language Acquisition, 43(2), 454–461. https://doi.org/10.1017/S0272263121000449

13.

Webb

Sasao

Ballance

(2017). The updated Vocabulary Levels Test: Developing and validating two new forms of the VLT. ITL—International Journal of Applied Linguistics, 168(1), 33–69. https://doi.org/10.1075/itl.168.1.02web

14.

Zhang

Koda

(2017). Assessing L2 vocabulary depth with word associates format tests: Issues, findings, and suggestions. Asian-Pacific Journal of Second and Foreign Language Education, 2(1), 1–30. https://doi.org/10.1186/s40862-017-0024-0