Book review: Eniko Csomay and William J. Crawford,Doing Corpus Linguistics

Abstract

In Doing Corpus Linguistics, Eniko Csomay and William J. Crowford discuss the study of linguistics, corpus linguistics, and language variation, presenting a comprehensive overview. Chapter 1 begins by highlighting the systematic nature of language and the implicit “rules” governing its usage, shedding light on the descriptive and prescriptive aspects of linguistic analysis. The discussion further explores the role of corpus linguistics in uncovering variation in language use, exemplified by the differentiation between the terms “equal” and “identical” in language use. The two words are often considered synonyms, with the use of corpora such as the Corpus of Contemporary American English (COCA) showing that they are different. The main characteristics of corpus linguistics, emphasise its empirical, computer-aided, and quantitative-qualitative analytical nature. Additionally, the chapter distinguishes between corpus-based and corpus-driven research approaches, where “Corpus-based” is driven by questions or hypotheses that exist before the analysis, while “corpus-driven” approaches are more open-ended and driven by patterns discovered in the analysis of the corpus itself.

The second chapter discusses the concept of register analysis, which focuses on describing and interpreting linguistic differences across different contexts. The chapter outlines the importance of register analysis in understanding language variation and highlights the key steps involved in this approach, such as describing situational characteristics, identifying linguistic variables, and providing a functional interpretation of the relationship between context and language use. The text also offers practical exercises to deepen readers’ understanding of the material For example, study speeches of Martin Luther King Jr. in order to understand how his “style” might differ from speeches given by other people such as Winston Churchill.

Furthermore, in Chapter 3 the corpus search used the Corpus of Contemporary American English (COCA) to describe the most common units of language found using four analyses, namely words, collocations, n-grams, and post tags. The words section is divided into keywords in context (KWIC) using one example word, namely “say” in spoken discourse, newspapers, and academic prose. On average it appears most frequently in spoken discourse (1937.05 times in a million words) under “SPOKEN,” followed by newspapers (831.94) under “NEWSPAPER,” and finally academic prose (216.88) as shown under “ACADEMIC” (pp. 36–37). Then, keyword analysis is used to identify words that are present in one text but not in the other. Furthermore, collocating according to Firth (1951) coined the term “collocation” which refers to two words that go together. The n-gram in linguistics is a series of words explored as a unit showing the number of words in the unit. And finally, part of speech (POS) tags, which look for grammatical patterns together, but do not rely on the actual word.

The fourth chapter contains projects using publicly available corpora. The fourth chapter contains projects that utilise the publicly available corpus. The first is word and phrase-based projects, the second is grammar-based projects, examples of which are found in the book, such as the sentence “the boy saw a ghost” (active sentence), and the second is called a passive sentence, for example, “the boy saw a ghost” (p. 72).

In the fifth chapter, the process of constructing a do-it-yourself corpus involves several key steps and considerations. Deciding on a corpus project requires careful thought regarding motivation, research goals, and the broader relevance of the chosen topic, framed within the context of register analysis and corpus linguistics methodology. Building the corpus entails selecting relevant texts, ensuring balance, formatting files for compatibility with analysis software, and providing clear labelling for easy identification and retrieval. Finally, utilizing software programs like AntWordProfiler and AntConc by Laurence Anthony facilitates lexical and grammatical analyses, enabling researchers to conduct comprehensive corpus studies effectively and contribute meaningfully to the field of corpus linguistics.

Chapter 6, in the pursuit of register analyses, aims to identify patterns of language use and their correlation with situational characteristics in texts, relying on empirical and quantitative measures for understanding these associations. The exercises at the end of the chapter further reinforce understanding, For example, determine the dependent and independent variables and whether they are interval, nominal, or ordinal scores. State the research question(s) and the null and alternative hypotheses. Mary was curious to find out whether the instructor’s gender or the course’s level of instruction has a greater effect on informational focus in university classroom talk. She built a small but balanced corpus of randomly selected 30 class sessions from multiple corpora (MICASE, BASE, and others) where male and female instructors and levels of instruction (undergraduate, graduate) were both represented. She tagged the texts with a grammatical tagger, counted the appropriate part of speech tags (see Chapter 9 about tagging), and normed the feature counts to 1,000 words each. She defined informational focus by adding the number of normed counts for nouns, attributive adjectives, nominalizations, and prepositions in each course she included in her corpus.

Chapter 7 of this paper provides an in-depth discussion of the various statistical tests used in the study, such as ANOVA, Chi-Square test, and Pearson’s Correlation. It provides detailed guidelines for conducting these tests. It underlines the importance of the interpretation of effect sizes, particularly Cohen’s d, to measure the strength of the relationship between two groups, with higher values indicating a stronger relationship. Additionally, it delves into the practical application and interpretation of statistical results, especially emphasizing Cohen’s d through a data example, if you are investigating the use of first-person pronouns across three disciplines (Business, Humanities, and Natural Sciences) you can calculate the normed count of first-person pronouns for each text in these disciplines and use a One-Way ANOVA to determine if there are significant differences in pronoun use. The F score is calculated by comparing the variance within each group to the variance across groups, with a significant difference indicating substantial across-group variation. Cohen’s d is then used to measure the effect size, illustrating the strength of the relationships between the groups.

Chapter 8 describes the analysis and interpretation of corpus data in corpus linguistics by emphasizing the comparison of individual, collaborative texts, word lists, n-grams, functional interpretation of results, and research findings. The essential corpus method presents the research context, data description, methodology, results, discussion, and conclusions.

Chapter 9, in corpus-based research methods, the number of frequencies is compared to ensure that the results are accurate and reliable. Normalization helps to account for variations in corpus size, text length, and sampling differences, thus enabling the comparison of linguistic features across different texts or corpora.

This book discusses the process of corpus-based research with an emphasis on the technical aspects of statistical analysis, register studies, and keyword analysis in corpus linguistics. This book will be very helpful for learners of corpus linguistics because of the many data examples presented, and of course in the future it can be a reference material for continuing similar research.

Footnotes

Acknowledgements

The authors are very grateful to LPDP (Lembaga Pengelola Dana Pendidikan) for providing complete financial support for our Master’s degree and this publication.

References

Firth

(1951) Modes of meaning. Essays and studies of the English Association [NS 4], pp.118–149.