Corpus Linguistics

Review of accessing existing corpora
Building your own corpus [1]: pre-processing task
Seminar presentation
[Lab] tools to create a corpus / Word Sketch Engine

Corpus linguistics is firmly rooted in empirical, inductive forms of analysis, relying on real-world instances of language use in order to derive rules or explore trends about the ways in which people actually produce language (as opposed to models of language that rely on made-up examples or introspection).

Two approaches

Corpus-driven tends to use a corpus in an inductive way in order to form hypotheses about language, not making reference to existing linguistic frameworks.
Corpus-based tends to use corpora in order to test or refine existing hypotheses taken from other sources.

Three ways to access the corpora

What software is there to perform linguistic analyses on the basis of corpora? and what can these software do?

Stand-alone (PC) software (WordSmith, Antconc, etc)
Web interface/service (BYU, Word Sketch Engine, Just the word, etc)
Programming (Python, R, etc)
- Advanced query with corpus query language.
- Advanced ways with scripting.

Build your own corpus from the web

Data collection
Data preprocessing
Data annotation
Data analysis
- Exploratory data analysis
- Forming hypothesis and statistic test

Data Collection: General Architecture

What are the tools that are used when compiling and annotating a corpus?

Data Collection: For non-programmer

TextSTAT.
ICEWeb (32bit Windows version only) : a small and simple utility for compiling, downloading and analysing web corpora. It is faster than TextSTAT in retrieving webpages, but does not explicitly save your settings. Ref.

Data Collection: For programmer

Python.NLTK
R

Drawing

Data Collection: special issues

What are the principles that are to take into consideration when compiling and annotating a corpus?

Size of Corpu (Sampling)
Representation
Question of Nativity
Identification of Target Users

Size of Corpus

Are there any hard rules regarding how large a corpus ought to be?

For the study of prosody (i.e. the rhythm, stress and intonation of speech), a corpus of 100,000 words will usually be big enough to make generalizations; for the analysis of verb-form morphology (i.e. the use of endings such as -ed, -ing and -s to express verb tenses) would require half a million words. (Kennedy (1998: 68)), while Biber (1993) suggests that a million words would be enough for grammatical studies.

Size of Corpus: A rule of thumb

the more varied the linguistic phenomenon, the larger the corpus required.

Representation

A corpus ought to be representative of a particular language, language variety, or topic, the texts within it must be chosen and balanced carefully in order to ensure that some texts do not skew the corpus as a whole.
But how do we ensure that a corpus consists of a sample that is ‘maximally representative of the variety under examination’ ?

Build your own corpus from the web

Data collection
Data preprocessing
Data annotation
Data analysis
- Exploratory data analysis
- Forming hypothesis and statistic test

Data Preprocessing: general considerations

the almost limitless information obtainable from the world aound us needs to be reduced to make it manageble.

Clean up. E.g, CNN

Data Preprocessing: general considerations

Markup: Corpus data always come with meta-information, e.g, individual texts within a corpus are often stored as separate files and each one can contain a 'header' which gives information about the text such as its author, date of publication, genre, etc.
- This information can be useful in allowing researchers to focus on particular types of texts (e.g. just newspaper articles) or carry out comparisons between different types (e.g. male vs female authors).
- Such annotation sometimes employs standard generalized mark-up language (SGML, XML, LMF, etc)

SGML example

An example taken from the start of a text in the FLOB (Freiberg Lancaster-Oslo/Bergen) corpus of early 1990s British English.

Drawing

Data Preprocessing: special issues

Word segmentation (in Chinese, Japanese, Thai, Lao, etc) as a special case in tokenization.
Morpho-syntactic tagging

An example of a morpho-syntactically tagged sentence (using the C5 tagset1) taken from the British National Corpus.

Drawing

Data Annotation

Corpora are often annotated (or tagged) with additional linguistic information, allowing more complex calculations to be performed on them.
- sense tagging
- discourse tagging
- pragmatic tagging
- emotion tagging
- (... your imagination/theory here ...)

Data Analysis

The QUAL-QUAN contrast : Statistics versus Researcher Sensitivity
- Categorizing the world ([QUAN]: predetermined numerical category system; [QUAL]:emergent, flexible verbal coding).
- Perceiving individual diversity ([QUAN]:using large samples to iron out any individual idiosyncrasies; [QUAL]: focusing on the unique meaning carries by individual organisms).
- Analyzing data ([QUAN]: relying on the formalized system of statistics; [QUAL]: relying on the researcher's individual sensitivity).

We need TOOLS!

A stand-alone corpus is not particularly useful in terms of aiding linguistic enquiry.
Corpora are normally used in conjunction with analysis software/web-based platform, which are able to carry out the counting, sorting and presentation of language features (the results of which must then be interpreted by humans).

We need MORE TOOLS!!

corpus collection and cleaning
corpus preprocessing (tokenization, segmentation)
corpus (atutomatic) tagging
corpus (manual) annotation
corpus pattern extraction
corpus statistic analysis
AND MORE !

‘c’est le point de vue qui crée l’objet’ (it is the viewpoint which creates the object), (Saussure).

If the dimensions of the viewpoint change as they did, the object created is substantially different from before.
Human input is required at almost every stage, from corpus building (deciding what should go in the corpus) to corpus analysis (what research questions should be asked, what should be looked for, what analytical procedures should be carried out, how the results can be interpreted).

'Corpus linguistics as a ‘methodology’ rather than a traditional branch of linguistics like semantics, grammar, phonetics or sociolinguistics'. (McEnery and Wilson (1996))

What we have witnessed in the development of corpus linguistics as a discipline is that our chosen methodological standpoint has progressively determined both the object and the aim of the enquiry.

Methodological call
in the Context of Web as Corpus

The explosion of information that affects corpus building and theorizing
The change in the quality of evidence is now obvious to most scholars and observations about instances of language use affect systematically the statements about the language system in general.
The problem for the linguist has shifted from accessing large enough quantities of data to elaborating a reliable methodology to describe and take into account this type of unprecedented evidence.

Lab session Word Sketch Engine

Thanks

width

Shower Presentation Template
Author: Vadim Makeev, Opera Software
Modified: Ramnath Vaidyanthan, for Slidify

Text Analysis with R

library(zipfR)
data(ItaRi.spc)
data(ItaRi.emp.vgc)
summary(ItaRi.spc)

## zipfR object for frequency spectrum
## Sample size:     N  = 1399898 
## Vocabulary size: V  = 1098 
## Class sizes:     Vm = 346 105 74 43 39 25 27 15 ...

par(mfrow = c(2, 2))

VGC and more

plot(ItaRi.spc)

plot of chunk unnamed-chunk-2

plot(ItaRi.spc, log = "x")

plot of chunk unnamed-chunk-2

VGC and more

plot(ItaRi.spc, main = "Frequency Spectrum")

plot of chunk unnamed-chunk-3

plot(ItaRi.emp.vgc, add.m = 1)

plot of chunk unnamed-chunk-3

Corpus Linguistics