2017 國立嘉義大學

Programming for Corpus Linguistics

語料庫語言學與程式設計

cover

[國立台灣大學語言學研究所] 謝舒凱

Outline

  1. Background
  2. Corpus Linguistics
  3. Challenges for Corpus Linguistics
  4. Conclusion

Outline

  1. Background
  2. Corpus Linguistics
  3. Challenges for Corpus Linguistics
  4. Conclusion

時代背景

History of Corpus Linguistics

似乎已不再有趣的辯論

Drawing

Corpus Linguistics: Definition

Long-established definition/applications of corpus

as a collection of authentic (naturally occurring) language, either written or spoken, which has been compiled for a particular purpose.(Sinclair 1991, Stubbs 1996, Hunston 2002)

Corpus Linguistics

Drawing (Candlin and Hall (eds.) 2012)

Corpus Linguistics: Methodology

The need for a ‘sizeable sample of real-life usage’ to ensure there exists adequate evidence for generating or testing hypotheses about the language. (Sampson, 2001)

Corpus Linguistics: Methodology

Comparing Corpora (i.e. language samples)

Corpus.Linguistics: Tools

請放下語言學家的尊嚴

手中有了錘子,看什麼都是釘子

Drawing

Corpus Tools

Concordance (是怎麼玩壞語料庫語言學的)

Drawing 妳會看幾頁的 google search results?

Corpus Tools

word and ngram frequency

Drawing

Corpus Tools

colligation and collocation (network)

(GraphColl, Brezina et. al. 2015)

Drawing

Corpus tools + Data Science

Profile, Dashboard, visualization

Drawing

Corpus Tools + Data Visualization

motion chart for dynamic visualization of language change

Corpus Tools + Natural Language Processing

Corpus Linguistics in a Post-concordancer Era [Wang, 2017].

Corpus Tools: Summary

Summary of main limitations of 'corpus linguistics'(Candlin and Hall, 2012)

How social media is changing Language/Corpus linguistics/NLP

Drawing

Corpus as Social Sensor in Taiwan: Plurk

Drawing

Corpus as Social Sensor in Taiwan: Plurk

Drawing

Corpus as Social Sensor in Taiwan: PTT

Drawing

http://lopen.linguistics.ntu.edu.tw/pttcorp

PTT dynamic crawler

Drawing

Outline

  1. Background
  2. Corpus Linguistics
  3. Challenges and *Programmable* Corpus Linguistics
  4. Conclusion

Challenges

A corpus presents decontextualised language data divorced from its original context. (Aston 1995; Widdowson 1998)

WaC is in

Drawing Drawing

Why Chinese WaC is Wacky

Frequency Spectrum (Hsieh, 2013)

Drawing

Why Chinese WaC is Wacky

Vocabulary Growth Curve (Hsieh, 2013)

Drawing

Even Manual check is not that easy

(plurk_word_list: 朱學(Na)恆(D))

更有語言學味道的議題容易被忽略

還沒講到漢語語言變異的複雜

Drawing http://lopen.linguistics.ntu.edu.tw/diffseg/

Developing Linguistic Annotation for Machine Learning Algorithms

Annotation

標記是語言學家的當代逆襲

Lopotator

我們需要彈性的工具

Drawing

語料庫還能怎樣?本體知識融入

Drawing

語料庫還能怎樣?常識規整

Drawing

社會性與歷史性

Drawing

從言談的舞蹈與音樂性到情緒分析

How can we know the dancer from the dance” (William Butler Yeats)

From Corpus to Knowledge

Drawing

Diachronic character/word embeddings

Outline

  1. Background
  2. Corpus Linguistics
  3. Challenges and *Programmable* Corpus Linguistics
  4. Conclusion

Programming for Corpus Linguistics

(a.k.a) Corpus Linguistics with Python/R 先了解市面上的工具

Programming for Corpus Linguistics

單機 系統 編程
爬蟲 bootcat wordsketch :)
前處理 antconc :)
索引 CWB(CQL) :)
分析 wordsketch etc. :)

WordHoard, GraphColl, Coquery, ......

Programming for Corpus Linguistics

了解現有工具,再了解自己的(研究/應用)需求

Lab session

Outline

  1. Background
  2. Corpus Linguistics
  3. Challenges and Programmable Corpus Linguistics
  4. Conclusion

My two cents

This is a good time to become a corpus linguist.

Chatbot 終將一統江湖?

Drawing

Thank you

Drawing

Reference

[1] J. Pannebaker. 2011. The secret life of pronouns: what our words say about us. Bloomsbury Press.

[2] Candlin and Hall (eds.) 2012. Corpora and language education. Lynne Flowerdew.

[3] Wang S.H. (2017) Text Analysis of Corpus Linguistics in a Post-concordancer Era. In: Wu TT., Gennari R., Huang YM., Xie H., Cao Y. (eds) Emerging Technologies for Education. SETE 2016. Lecture Notes in Computer Science, vol 10108. Springer.