Corpus Linguistics

Corpus Linguistics: A General Introduction


Table of Contents

  1. Basics
  2. Accessing and Analzing Methods
  3. Corpus annotation
  4. Corpus-based Studies
  5. [Lab] Using available corpora

What is corpus linguistics

language researchers do not have to rely on their own or other native speakers’ intuition or even on made-up examples.

Corpus Typology

Exercise: A brand-new corpus type?


Three ways to access the corpora

What software is there to perform linguistic analyses on the basis of corpora? and what can these software do?

Corpus Design: Key considerations


How can we know that the sample we are using is representative of the language or language variety?

Are there any objective ways to balance a corpus or to measure its representativeness?

What can corpus tools offer?








Google Books for Culturomics

Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of 'culturomics,' focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology.(Science, 331(6014): 176–82, 2011).

Google book ngram (semi-) corpus


Google book ngram (semi-) corpus




Corpus Visualization

Linguistic Motion Chart

Data Science Analytics (makes advances like never before)

## Loading required package: googleVis
## Welcome to googleVis version 0.5.10
## Please read the Google API Terms of Use
## before you start using the package:
## Note, the plot method of googleVis will by default use
## the standard browser to display its output.
## See the googleVis package vignettes for more details,
## or visit
## To suppress this message use:
## suppressPackageStartupMessages(library(googleVis))
Modal <- gvisMotionChart(convdata, 

Build Your Own Corpus

Corpus Statistics: Counting

Corpus Statistics: Unit

New Methodological Issues [1]: Size

Do we really need (more than) 500 billion words for linguistics?

Are there any hard rules regarding how large a corpus ought to be?

New Methodological Issues [1]: Size

For the study of prosody (i.e. the rhythm, stress and intonation of speech), a corpus of 100,000 words will usually be big enough to make generalizations; for the analysis of verb-form morphology (i.e. the use of endings such as -ed, -ing and -s to express verb tenses) would require half a million words. (Kennedy (1998: 68)), while Biber (1993) suggests that a million words would be enough for grammatical studies.

Depending on you research topic!

Homework (20150925)