Recap: Levels of Annotation

• Linguistic levels of corpus annotation:
• Phonetic/Prosodic/POS/Syntactic/Semantic/Discoursal/Pragmatic/Stylistic
• Paralinguistic levels of corpus annotation:
• Emotion/Affect/Personality
• Conceptual levels
• ontological class/common-sense

More on Levels of Annotation

• More complex than straightforward class-labeling or relation-labeling.
• Coreference annotation
• References to entities in a document are identified as mentions, and mentions of the same entity are linked as being coreferent, e.g., pronouns, named entities (Person, Location, organization) etc.
• Temporal relations: identification of (nominal or verbal) events and their participants.(cf. TimeBank corpus)

More on Levels of Annotation

Units (crossing the sentence boundary) reflect the communicative function of the sentence

• Topic-focus articulation/rhetorical structures and discourse connectives/anaphora and coreference

Topic-Focus Articulation (TFA)

• Topic: What is the sentence about?
• The topic (or theme) is the part of the proposition that is being talked about (predicated). Once stated, the topic is therefore old news, i.e. the things already mentioned and understood.
• Focus: What information about the topic is asserted?
• The focus determines which part of the sentence contributes the most important information. The focus may be highlighted either prosodically or syntactically or both, depending on the language.
• An important role is played by the position of the intonation marker.

TFA: Example

Prague Dependency Treebank / [source: Schulte im Walde & Zinsmeister, 2006]

Rhetorical Structure

• Rhetorical Structure Theory (RST) : a theory of discourse structure which offers an explanation of the coherence of texts.
• Two (adjacent) spans of text are related such that one of them has a specific role relative to the other.
• The claim span is called a nucleus, and the evidence span is called a satellite.

Ref

Discourse Connectives

• Connectives: subordinating, coordinating, adverbial, and implicit.
• Penn Discourse TreeBank (PDTB) is a large scale corpus annotated with discourse connectives and their arguments (with associated semantic roles).

Discourse Connectives: Subordinating conjunctions

• Clauses that syntactically depend on the main clause:
• temporal (such as when, as soon as)
• concessive (such as because)
• purpose (such as so that, in order that)
• conditional (such as if, unless)

Because [the drought reduced U.S. stockpiles], [they have more than enough storage space for their new crop], and that permits them to wait for prices to rise.

Discourse Connectives: Coordinating conjunctions

• Coordinating conjunctions are ones such as and, but, and or.
• Coordination of nominal, other non-clausal constituents, and VP-coordination are excluded.

[William Gates and Paul Allen in 1975 developed an early language-housekeeper system for PCs], and [Gates became an industry billionaire six years after IBM adapted one of these versions in 1981].

Opinion annotation

• opinions, evaluations, emotions, sentiments, and other private states in texts.
• typically involves filling in several different feature values rather than simply assigning class labels.

Emotional Chunks (Hsieh and Lu)

• Linguistic Units of affective expressions: the way we identify the expressive units of emotions will have influence on how researchers conceive their nature and their functioning.
• The formal treatment of language as prevalently assumed in linguistics, requires the sound and meaning-bearing linguistic units to be discretely distributed and governed by syntax. The analysis of emotional expressions under such framework will be restricted by the predefined grammatical boundary, like lexical or phrasal categories, etc.

Annotation Quality

Crucial issue: are the annotations correct?

• Machine learns to make same mistakes as human annotator, thus resulting in misleading evaluation of the performance.
• Inconclusive and misleading results from linguistic analysis.

Validity vs. Reliability

(Artstein and Poesio, 2008)

• We are interested in the validity of the manual annotation i.e. whether the annotated categories are correct, but there is no "ground truth":
• Linguistic categories are determined by human judgment
• Consequence: we cannot measure correctness directly

Validity vs. Reliability

(Artstein and Poesio, 2008)

• Instead measure reliability of annotation

• i.e. whether human annotators consistently make same decisions $\rightarrow$ they have internalized the scheme.
• Assumption: high reliability implies validity
• How can reliability be determined?

Cases

• each item is annotated by a single annotator, with random checks (≈ second annotation)
• some of the items are annotated by two or more annotators
• each item is annotated by two or more annotators - followed by reconciliation
• each item is annotated by two or more annotators - followed by final decision by superannotator (expert)

In all cases, measure of reliability is to calculate the coefficients of agreement.

Rare Case

In some rare cases, there exists a "correct" annotation (gold standard).

• Recall measures the quantity of found annotations

$Recall = \frac{Nb of correct found annotations}{Nb of correct expected annotations}$

Rare Case

• Precision: measures the quality of found annotations

$Precision = \frac{Nb of correct found annotations}{Total nb of found annotations}$

• F1-score: Harmonic mean of precision and recall or balanced

$F1 = 2 * \frac{P*R}{P+R}$

What if no gold standard exists?

$S$, $\kappa$, and $\pi$ measure.

