- Statistics and Machine Learning 101
- Text classification using
謝舒凱 Graduate Institute of Linguistics, NTU
Preparing / Preprocessing text and data. Text is unstructured or partially structured data that must be prepared for analysis. We extract features from text. We define measures. Quantitative data are often messy or missing. They may require transformation prior to analysis. Data preparation consumes much of a data scientist’s time.
Exploratory data analysis and Infographics (data visualization for the purpose of discovery. We look for groups in data, find outliers, identify common dimensions, patterns, and trends.)
Prediction models (Regression; Classification and Clustering; ) and Evaluations (Recommender systems, collaborative filtering, association rules, optimization methods based on linguistic heuristics, as well as a myriad of methods for regression, classification, and clustering fall under the rubric of machine learning).
code frameto categorize text.
最簡單可以用 Excel 來做：
CAT (Coding Analysis Toolkit)
labeling 和 annotation 的差異之後再談。
create_matrix] Import your hand-coded data into R
create_corpus] 把「不相關」的資料移除，建立訓練語料 (training dataset) 與測試語料 (test data)
train model(s)] Choose machine learning algorithm(s) to train a model
build classification model(s)] Test on the (out-of-sample) test data; establish accuracy criteria 了解成效。
apply classification model(s)] Use model to classify novel data
create analytics] 把自動分錯的資料找出來 Manually label data that do not meet accuracy criteria