Linguistic Analysis and Data Science

lecture 08

謝舒凱 Graduate Institute of Linguistics, NTU



  1. Statistics and Machine Learning 101
  2. Text classification using RTextTools


  • Preparing / Preprocessing text and data. Text is unstructured or partially structured data that must be prepared for analysis. We extract features from text. We define measures. Quantitative data are often messy or missing. They may require transformation prior to analysis. Data preparation consumes much of a data scientist’s time.

  • Exploratory data analysis and Infographics (data visualization for the purpose of discovery. We look for groups in data, find outliers, identify common dimensions, patterns, and trends.)

  • Prediction models (Regression; Classification and Clustering; ) and Evaluations (Recommender systems, collaborative filtering, association rules, optimization methods based on linguistic heuristics, as well as a myriad of methods for regression, classification, and clustering fall under the rubric of machine learning).


文本統計 (Textual Statistics)

  • 文本統計學的知識
  • 相似與關聯為例


  • AI 的一個子領域。(參見林軒田老師的線上課程)
  • 監督式 supervised vs. 非監督式 unsupervised
    • 可以用中文斷詞問題來想
    • 圖解法入門:基本概念與決策樹


Annotation and Feature Engineering

  • Study of recorded human communication
  • Summary and quantitative analysis of communicated messages
  • Researcher looks for patterns/themes in text; develops code frame to categorize text.
  • Essentially, variables are extracted from text: Based on scientific method; establishes objectivity via inter-coder reliability.

Annotation and Feature Engineering: Pros and Cons


  • flexible; theoretically-motivated annotation/code frame effrots
  • can apply to texts, speech, video, etc.
  • 可以用來解決一般機器學習系統 high precision low recall 的問題。把潛在的語意與情緒發掘出來。


  • manually intensive
  • thus can be expensive


  • 最簡單可以用 Excel 來做:

    • One (or more) column(s) for text data; One column for topic label (as gold standard)
    • 通常至少有多於 3000 份標好的文件。
  • 大型的專案要考慮到永續、相容、交換等問題,建議使用標記系統。

    • 語料庫和語言處理社群 GATE
    • 質性研究社群 CAT (Coding Analysis Toolkit)
    • lopetator
  • labeling 和 annotation 的差異之後再談。


  • [create_matrix] Import your hand-coded data into R
  • [create_corpus] 把「不相關」的資料移除,建立訓練語料 (training dataset) 與測試語料 (test data)
  • [train model(s)] Choose machine learning algorithm(s) to train a model
  • [build classification model(s)] Test on the (out-of-sample) test data; establish accuracy criteria 了解成效。
  • [apply classification model(s)] Use model to classify novel data
  • [create analytics] 把自動分錯的資料找出來 Manually label data that do not meet accuracy criteria
  1. Statistics and Machine Learning 101
  2. Text classification using RTextTools


  • RTextTools 可自動化某些標記工作,與監督式文本自動分類。簡單,但是有記憶體問題,中文支援有問題。
  • "One-stop-shop for conducting supervised machine learning with textual data" 邊看這篇邊做看看. 參考程式範例

Kaggle for Midterm.Mini-Hackathon

  • Kaggle: the home of data science 連結
  • 本週自己再看看 kaggle 怎麼運作。