How corpus linguistics can inform L2 vocabulary instruction: The use of frequency levels

Roger Gee
Holy Family University

Corpus linguistics has contributed to the field of education in numerous ways. The use of word frequency levels in language education has been especially important. Frequency levels are useful in second language (L2) instruction as it is assumed that the more frequent words are most immediately useful, that they are learned first, and that the less frequent words are learned later. Rather than intuition, L2 materials developers and educators have used corpus linguistics to obtain reliable information about frequency levels.

This presentation will report the results of a corpus-based study of the vocabulary used for instruction by, itself a corpus-based, freely available game-like site for vocabulary learning (Abrams & Walsh, 2014). Abrams and Walsh suggest that the game-like features of the site promote its afterschool use for vocabulary learning. It logically follows that for words to be learned, the vocabulary used for instruction should not be less frequent than the target words. However, initial inspection of the definitions, usage notes, and sample sentences indicates that these materials may contain a significant percentage of words of a lower frequency than the target word.

It has been argued that for L2 readers, at least 98% of the words (tokens) must be known for adequate comprehension (Nation, 2013). Laufer (2013) reviewed corpus-based research and research involving reading comprehension tests and determined two lexical threshold levels for reading academic, nonfiction texts. One, an optimal threshold level of 8,000 words, would provide about 98% coverage of a text’s vocabulary and allow for unassisted reading. Another minimal lexical threshold level of “around 5,000” words (p.809) would not allow for unassisted reading of unsimplified academic texts.

The research reported in this presentation focuses on the 3000-5000 word levels. Words in the 3000-5000 word frequency levels are part of mid-frequency vocabulary (Nation, 2013) and represent a fairly rigorous goal for most language learners. As Webb and Sasao (2013) note, “mastery of the 5000 word level may be challenging for all but advanced learners … the five most frequent levels may represent the greatest range in vocabulary learning for the majority of L2 learners” (p. 266).

A corpus was constructed of the defining material for every 20th word of the 3000-5000 word lists from COCA (Davies, 2008- ). That is, it begins with word 2001 and ends with word 4981, for a total of 150 words, with 50 from each 1000 word level. The corpus contains the definitions, usage notes, and sample sentences for each of the 150 words. The frequency levels of the defining material for these words were determined using Text Lex Compare and COCA frequency lists.

The presenter will give an introduction to, followed by a description of the COCA frequency lists, details of the construction of the corpus, and the method of analysis. To focus on the “why,” the results will be contrasted with Laufer’s (2013) vocabulary threshold levels. The session will end with time for questions.

Abrams, S. a., & Walsh, S. S. (2014). Gamified vocabulary: Online resources and enriched language learning. Journal of Adolescent & Adult Literacy, 58, 49-58. doi:10.1002/jaal.315
Davies, Mark. (2008- ) The Corpus of Contemporary American English: 450 million words, 1990-present. Available online at
Laufer, B. (2013). Lexical thresholds for reading comprehension: What they are and how they can be used for teaching purposes. TESOL Quarterly, 47, 867-872. doi: 10.1002/tesq.140
Nation, I. S. P. (2013). Learning Vocabulary in Another Language. Cambridge, UK: Cambridge University Press.
Webb, S. A. & Sasao, Y. (2013). New directions in vocabulary testing. RELC Journal, 44, 263-277. doi:10.1177/0033688213500582.