Corpus Linguistics

Corpus Linguistics: Annotation [1]


Shu-Kai Hsieh

GIL, National Taiwan University

Today's Topics

  1. Linguistic Annotation [1]
  2. Seminar presentation: morpho-syntactic tagging
  3. [Lab] Word Sketch Engine

Annotation: a major step toward the interoperability of language resources

Annotation invovles a methodology for adding information to a document at some level—a word or phrase, paragraph or section or the entire document..... The ultimate goal is to enable interoperability among annotations for different linguistic phenomena for the same language, together with linguistic annotations applied to different languages and modalities.

Annotation Science

Annotation: what to annotate?

Annotation: standardized framework?

Annotation: quality control

Annotation: quality control

Inter-annotator agreement

Annotation: quality control

Intra-annotator agreement

Annotation: quality control

How to measure agreement between annotators?

Annotation: quality control

How good is a given annotation? Is it correct? Is it consistent? How can you check this for thousands of sentences? The annotation manual may easily be 50 or 100 pages long, and annotation takes a lot of time. E.g. SALSA: 20,000 sentences, about 4 years

Is there any way we can speed this up?

Conducting a corpus annotation project

Things must be taken into consideration: (Palmer and Xue, 2009)

Annotation schemes: Morpho-Syntactic Tagging

Annotation schemes: Syntactic structure (e.g., treebanking)

parser demo

Exercise [1]

For advanced use:

Exercise [2]


Preparing for the quiz. (Antconc, BNC-WEB and WSE)

Lab session Word Sketch Engine



