Corpus Linguistics

Corpus-based Analysis [1]


Shu-Kai Hsieh

GIL, National Taiwan University

Today's Topics

  1. Corpus-based Analysis [1]
  2. Seminar presentation
  3. [Lab] GATE platform

Corpus Annotation and Corpus-based Linguistic Analysis

Corpus Annotation is NOT the must

Corpus Annotation and Corpus-based Linguistic Analysis

Corpus Annotation is the need when no choice.

Corpus-based Empirical Methods

Corpus-based Empirical Methods

Distribution : toward Unified Empirical Linguistics [1], where evidence of all kinds - textual, psychological and neurological - is a matter of course used in concert to uncover the nature of language. In such context, corpus linguistics will reach its full potential as a methodology.

Review: A chain of works corpus data science involves

  1. Pre-processing (cleaning, tokenizing, segmentation, etc)
  2. Data annotation (Semi-automatic) Labeling (POS tagging) and Management
  3. Exploratory Data Analysis (with workable knowledge of Statistics)
  4. Hypothesis testing
  5. Prediction and Statistical Modeling, etc
  6. Presentation and Web application (Demo: Shiny-LexicoR)

Corpus-based Analysis: Basics

A corpus does not contain new information about language, but the software offers us a new perspective on the familiar.

  • Two related processes: the production of frequency lists (either in rank order, sorted alphabetically, or according to keyness) and the generation of concordances (examples of particular items in context) -> Note Bene.: In order to compare frequency counts across corpora of different sizes, a process of normalisation is required.

Corpus-based Analysis: Basics

Now, what can concordance/frequency reveal the unknown?


Only 4 lines (lines 5, 13–15) is now clause-medial, and acting as a temporal adverb ; with remaining clause-initial and has discourse-level functions, signalling a change of focus or topic !!!

Corpus-based Analysis: Pattern recognition

Digging out patterns: An Example

the black experience (taken from Obama’s speech from a 450-million-word Bank of English corpus.)

Digging out patterns: An Example

Why are patterns difficult to spot?

  1. repetition in naturally occurring conversation is transient, fleeting; it may have no perceptible effect, or its effects may not be ascribed to the repetition itself.
  2. patterning involves the repetition of ‘things’, but those ‘things’ may be of many different kinds.


Patterns can also be interpreted as a co-occurrence of a language form and a particular context.

collocation system demo Just-the-word

Semantic prosody and semantic preference

Semantic prosody and semantic preference

Corpus-based Studies: Dimensions

  1. 功能 Functional Linguistics
  2. 計算 Computational Linguistics
  3. 歷程 Psycholinguistics and Language Acquisition
  4. 應用 Lexicography and Language Teaching

Functional Linguistics

Computational Linguistics

Psycholinguistics and Language Acquisition

Lexicography and Language Teaching


[1] Write an essay: How to use corpus (linguistics) to (write dictionaries | the study of health communication | language teaching and learning | ........)

[2] Annotate a text with (word sense | pos | polarity| discourse marker | named entity | ......)

Review: Annotation projects involve:

  1. Definition of target phenomena and decision of annotation level
  2. Drafting guidelines and annotation schemas
  3. Choosing annotators (Alternative)
  4. Annotation tools
  5. Evaluation

Lab session: Annotation tools


Lab session GATE


Lab session: Some terminologies

(Prevor, 2014)



Shower Presentation Template
Author: Vadim Makeev, Opera Software
Modified: Ramnath Vaidyanthan, for Slidify