Here are eleven of the most widely used statistical methods in
#CorpusLinguistics.
#CorpusStatistics
#Frequency Analysis: This is the cornerstone of corpus linguistics, used to identify the frequency of words, phrases, or syntactic structures. Biber’s “Variation Across Speech and Writing” (1988) is a classic study that employed frequency analysis to distinguish between spoken and written registers.
#FrequencyAnalysis
#Collocation Analysis: This method identifies words that tend to appear together more often than would be expected by chance. Stefanowitsch and Gries’ “Collostructions” (2003) is a notable study in this area.
#CollocationAnalysis
#Concordance Analysis: This involves examining all the occurrences of a particular word or phrase within its context. John Sinclair’s work, particularly in “Corpus, Concordance, Collocation” (1991), has been foundational.
#ConcordanceAnalysis
#Keyness Analysis: This method identifies statistically significant words in a corpus compared to a reference corpus. Paul Rayson’s “Matrix: A Statistical Method and Software Tool for Linguistic Analysis” (2003) is a key reference.
#KeynessAnalysis
#Cluster Analysis: This is used to group similar items in a corpus, often revealing patterns or themes. Douglas Biber’s “University Language” (2006) employed cluster analysis to study academic registers.
#ClusterAnalysis
#Mutual Information: This measures the strength of association between two words. Church and Hanks’ “Word Association Norms, Mutual Information, and Lexicography” (1990) is a seminal paper that introduced this concept.
#MutualInformation
#Log-Likelihood Ratio: This is used to test the significance of the difference between two proportions, often in comparing corpora. Dunning’s “Accurate Methods for the Statistics of Surprise and Coincidence” (1993) is a key study here.
#LogLikelihoodRatio
Principal Component Analysis (
#PCA): This reduces the dimensionality of the data while retaining most of the original variance. Baayen et al.’s “Mixed-effects modeling with crossed random effects for subjects and items” (2008) utilized PCA.
#PrincipalComponentAnalysis
#Chi-Square Test: This tests the independence of two categorical variables. It was notably used in McEnery, Xiao, and Tono’s “Corpus-Based Language Studies” (2006).
#ChiSquareTest
#T-Score: This measures the “bond” between two words in a collocation. “Collocations in Use” (Hill and Lewis, 1997) is a study that employed T-Scores.
#TScore
#Log-Dice Statistics: This method is an advancement over Mutual Information and is particularly useful for large corpora. It provides a normalized score that allows for better comparison of word associations across different datasets. Rychlý’s “A Lexicographer-Friendly Association Score” (2008) is a seminal paper that introduced log-dice statistics as a lexicographer-friendly measure.
#LogDiceStatistics
Each of these methods has its own advantages and specific applications, making them invaluable in the toolkit of any corpus linguist. Whether you’re exploring lexical trends or syntactic structures, these methods offer robust, reliable ways to make sense of the data.