Corpus Linguistics

Corpus Linguistics: Pre-Processing


Shu-Kai Hsieh

GIL, National Taiwan University

Table of Contents

  1. Review of accessing existing corpora
  2. Building your own corpus [1]: pre-processing task
  3. Seminar presentation
  4. [Lab] tools to create a corpus / Word Sketch Engine

Corpus linguistics is firmly rooted in empirical, inductive forms of analysis, relying on real-world instances of language use in order to derive rules or explore trends about the ways in which people actually produce language (as opposed to models of language that rely on made-up examples or introspection).

Two approaches

Three ways to access the corpora

What software is there to perform linguistic analyses on the basis of corpora? and what can these software do?

Build your own corpus from the web

Data Collection: General Architecture

What are the tools that are used when compiling and annotating a corpus?

Data Collection: For non-programmer

Data Collection: For programmer


Data Collection: special issues

What are the principles that are to take into consideration when compiling and annotating a corpus?

Size of Corpus

Are there any hard rules regarding how large a corpus ought to be?

For the study of prosody (i.e. the rhythm, stress and intonation of speech), a corpus of 100,000 words will usually be big enough to make generalizations; for the analysis of verb-form morphology (i.e. the use of endings such as -ed, -ing and -s to express verb tenses) would require half a million words. (Kennedy (1998: 68)), while Biber (1993) suggests that a million words would be enough for grammatical studies.

Size of Corpus: A rule of thumb


Build your own corpus from the web

Data Preprocessing: general considerations

the almost limitless information obtainable from the world aound us needs to be reduced to make it manageble.

Data Preprocessing: general considerations

SGML example

An example taken from the start of a text in the FLOB (Freiberg Lancaster-Oslo/Bergen) corpus of early 1990s British English.


Data Preprocessing: special issues

An example of a morpho-syntactically tagged sentence (using the C5 tagset1) taken from the British National Corpus.


Data Annotation

Data Analysis

We need TOOLS!

We need MORE TOOLS!!



‘c’est le point de vue qui crée l’objet’ (it is the viewpoint which creates the object), (Saussure).

'Corpus linguistics as a ‘methodology’ rather than a traditional branch of linguistics like semantics, grammar, phonetics or sociolinguistics'. (McEnery and Wilson (1996))

What we have witnessed in the development of corpus linguistics as a discipline is that our chosen methodological standpoint has progressively determined both the object and the aim of the enquiry.

Methodological call
in the Context of Web as Corpus

Lab session Word Sketch Engine



Shower Presentation Template
Author: Vadim Makeev, Opera Software
Modified: Ramnath Vaidyanthan, for Slidify

Text Analysis with R

## zipfR object for frequency spectrum
## Sample size:     N  = 1399898 
## Vocabulary size: V  = 1098 
## Class sizes:     Vm = 346 105 74 43 39 25 27 15 ...
par(mfrow = c(2, 2))

VGC and more


plot of chunk unnamed-chunk-2

plot(ItaRi.spc, log = "x")

plot of chunk unnamed-chunk-2

VGC and more

plot(ItaRi.spc, main = "Frequency Spectrum")

plot of chunk unnamed-chunk-3

plot(ItaRi.emp.vgc, add.m = 1)

plot of chunk unnamed-chunk-3