MultiMoCo
MultiMoCo NTU
A pioneering large-scale multimodal corpus for languages in Taiwan that integrates video, dialogue, caption, and gesture layers with human annotation and multimodal machine learning workflows.

Alumni // Former M.A. Student
MultiMoCo
A pioneering large-scale multimodal corpus for languages in Taiwan that integrates video, dialogue, caption, and gesture layers with human annotation and multimodal machine learning workflows.
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation 2023
Contextualized embeddings have proven to be powerful tools in various NLP tasks. However, their interpretability and how they encode lexical semantics remain challenging issues. In this paper, we tackle this problem by using definition modeling, a technique that aims to generate human-readable definitions for words, as a means to evaluate and understand high-dimensional semantic vectors. We introduce the Vec2Gloss model, which generates glosses from the contextualized embeddings of target words. The systematic gloss patterns provided by Chinese Wordnet enable us to examine the mechanism behind the model’s gloss generation. To delve deeper into this mechanism, we devise two dependency indices to measure the semantic and contextual dependencies of the generated glosses. These indices allow us to analyze the generated texts at both the gloss and token levels. Our results demonstrate that the proposed Vec2Gloss model enhances our understanding of lexical semantics in contextualized embeddings.
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021) 2021
The rapid flow of information and the abundance of text data on the Internet have brought about the urgent demand for the construction of monitoring resources and techniques used for various purposes. To extract facets of information useful for particular domains from such large and dynamically growing corpora requires an unsupervised yet transparent ways of analyzing the textual data. This paper proposed a hybrid collocation analysis as a potential method to retrieve and summarize Taiwan-related topics posted on Weibo and PTT. By grouping collocates of 臺灣 ‘Taiwan’into clusters of topics via either word embeddings clustering or Latent Dirichlet allocation, lists of collocates can be converted to probability distributions such that distances and similarities can be defined and computed. With this method, we conduct a diachronic analysis of the similarity between Weibo and PTT, providing a way to pinpoint when and how the topic similarity between the two rises or falls. A fine-grained view on the grammatical behavior and political implications is attempted, too. This study thus sheds light on alternative explainable routes for future social media listening method on the understanding of cross-strait relationship.
2020 International Conference on Technologies and Applications of Artificial Intelligence (TAAI) 2020
The modern conversational agent requires high-quality datasets, which are often the bottlenecks when building models. This paper introduces MatDC, an entirely human-produced dialogue dataset with full semantic annotations in Chinese. The dataset features linguistic variations given users' intents and fully annotated semantic slots. MatDC dataset was completely human-edited, and the curation comprises two stages. At first, templates design stage, domain editors first construct schemas and compose ten dialogues between the agents and the users based on the back-end database. Secondly, in the dialogue rewrite stage, rewriters generate sentential variations for each template, under the constraints that the normalized slot values are kept unchanged. The underlying methodology of the MatDC is more open to extension and more adaptable to different domains. To demonstrate the applicability of the dataset, we build a dialogue agent with conventional pipeline architecture. We expect the MatDC dataset to provide additional training data and testing ground for dialogue agent studies.
// FRONTIER_RESEARCH