MultiMoCo
MultiMoCo NTU
A pioneering large-scale multimodal corpus for languages in Taiwan that integrates video, dialogue, caption, and gesture layers with human annotation and multimodal machine learning workflows.

Ph.D. Student // Lab IT Manager
Da-Chen Lian is a Ph.D. student at the Graduate Institute of Linguistics at National Taiwan University, advised by Prof. Shu-Kai Hsieh. His work sits between computational linguistics, large language models, and data-intensive approaches to language analysis. At LOPE he has contributed to MultiMoCo, a multimodal corpus of languages in Taiwan, and has led LLM pretraining work — including a Taiwan-law LLM trained on NVIDIA DGX H100 nodes — while also examining how tokenization, multilingual pretraining, and interpretability shape what language models actually learn about linguistic structure. He has served as LOPE's lab system administrator since 2017.
MultiMoCo
A pioneering large-scale multimodal corpus for languages in Taiwan that integrates video, dialogue, caption, and gesture layers with human annotation and multimodal machine learning workflows.
Journal of Library and Information Studies 2025
Large language models (LLMs) have in recent years spurred research across various sectors, owing to their remarkable zero-shot or few-shot performance. This capability has become indispensable for individuals seeking to integrate these language models into their workflows effectively. In this paper, based on in-depth linguistic analyses, we explore the application of an LLM, specifically GPT-4, in generating Chinese language textbooks tailored for grade school students. This encompasses the creation of main lesson texts alongside accompanying Chinese character exercises. Experimental results suggest that the LLM-generated textbook lessons are a viable research direction. The initial outcomes demonstrate the ability of LLM to generate texts of satisfactory quality appropriate for a specified grade level. The contributions of this work include pioneering the quantitative analysis of Chinese language textbooks for native speakers in Taiwan and leveraging an LLM to automatically generate textbook content and accompanying Chinese character exercises targeted at native Chinese speakers, which is a novel approach facilitated by the development of prompts tailored to different language learning levels. The study also conducts quantitative and qualitative comparisons between machine-generated lessons and those developed by educational professionals in Taiwan.
arXiv preprint arXiv:2504.13603 2025
The recent advances in Legal Large Language Models (LLMs) have transformed the landscape of legal research and practice by automating tasks, enhancing research precision, and supporting complex decision-making processes. However, effectively adapting LLMs to the legal domain remains challenging due to the complexity of legal reasoning, the need for precise interpretation of specialized language, and the potential for hallucinations. This paper examines the efficacy of Domain-Adaptive Continual Pre-Training (DACP) in improving the legal reasoning capabilities of LLMs. Through a series of experiments on legal reasoning tasks within the Taiwanese legal framework, we demonstrate that while DACP enhances domain-specific knowledge, it does not uniformly improve performance across all legal tasks. We discuss the trade-offs involved in DACP, particularly its impact on model generalization and performance in prompt-based tasks, and propose directions for future research to optimize domain adaptation strategies in legal AI.
Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge)@ LREC-COLING-2024 2024
Compressibility is closely related to the predictability of the texts from the information theory viewpoint. As large language models (LLMs) are trained to maximize the conditional probabilities of upcoming words, they may capture the subtlety and nuances of the semantic constraints underlying the texts, and texts aligning with the encoded semantic constraints are more compressible than those that do not. This paper systematically tests whether and how LLMs can act as compressors of semantic pairs. Using semantic relations from English and Chinese Wordnet, we empirically demonstrate that texts with correct semantic pairings are more compressible than incorrect ones, measured by the proposed compression advantages index. We also show that, with the Pythia model suite and a fine-tuned model on Chinese Wordnet, compression capacities are modulated by the model’s seen data. These findings are consistent with the view that LLMs encode the semantic knowledge as underlying constraints learned from texts and can act as compressors of semantic information or potentially other structured knowledge.
Frontiers in Language Sciences 2024
Formosan languages, spoken by the indigenous peoples of Taiwan, have unique roles in the reconstruction of Proto-Austronesian Languages. This paper presents a real-world Formosan language speech dataset, including 144 h of news footage for 16 Formosan languages, and uses self-supervised models to obtain and analyze their speech representations. Among the news footage, 13 h of the validated speech data of Formosan languages are selected, and a language classifier, based on XLSR-53, is trained to classify the 16 Formosan languages with an accuracy of 86%. We extracted and analyzed the speech vector representations learned from the model and compared them with 152 manually coded linguistic typological features. The comparison shows that the speech vectors reflect Formosan languages' phonological and morphological aspects. Furthermore, the speech vectors and linguistic features are used to construct a linguistic phylogeny, and the resulting genealogical grouping corresponds with previous literature. These results suggest that we can investigate the current real-world language usages through the speech model, and the dataset opens a window to look into the Formosan languages in vivo.
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023) 2023
In this research, we comprehensively analyze the potential biases inherent in Large Language Model, utilizing meticulously curated input data to ascertain the extent to which such data sway machine-generated responses to yield prejudiced outcomes. Notwithstanding recent strides in mitigating bias in LLM-based NLP, our findings underscore the continued susceptibility of these models to data-driven bias. We have integrated the PTT NTU board as our primary data source for this investigation. Moreover, our study elucidates that, in certain contexts, machines may manifest biases without supplementary prompts. However, they can be guided toward rendering impartial responses when provided with enhanced contextual nuances.
2020 International Conference on Technologies and Applications of Artificial Intelligence (TAAI) 2020
The modern conversational agent requires high-quality datasets, which are often the bottlenecks when building models. This paper introduces MatDC, an entirely human-produced dialogue dataset with full semantic annotations in Chinese. The dataset features linguistic variations given users' intents and fully annotated semantic slots. MatDC dataset was completely human-edited, and the curation comprises two stages. At first, templates design stage, domain editors first construct schemas and compose ten dialogues between the agents and the users based on the back-end database. Secondly, in the dialogue rewrite stage, rewriters generate sentential variations for each template, under the constraints that the normalized slot values are kept unchanged. The underlying methodology of the MatDC is more open to extension and more adaptable to different domains. To demonstrate the applicability of the dataset, we build a dialogue agent with conventional pipeline architecture. We expect the MatDC dataset to provide additional training data and testing ground for dialogue agent studies.
// FRONTIER_RESEARCH