MultiMoCo
MultiMoCo NTU
A pioneering large-scale multimodal corpus for languages in Taiwan that integrates video, dialogue, caption, and gesture layers with human annotation and multimodal machine learning workflows.

Alumni // Former Postdoctoral Researcher
Yu-Hsiang Tseng develops multimodal corpus workflows at LOPE, focusing on annotation pipelines, gesture recognition, and integrating machine learning insights with Taiwanese language data.
Postdoctoral Researcher
University of Tübingen, Department of General and Computational Linguistics
MultiMoCo
A pioneering large-scale multimodal corpus for languages in Taiwan that integrates video, dialogue, caption, and gesture layers with human annotation and multimodal machine learning workflows.
Proceedings of the 29th Conference on Computational Natural Language Learning 2025
Legal citations require correctly recalling the law references of complex law article names and article numbering, which large language models typically treat as multi-token sequences. Motivated by the form-meaning pair of constructionist approaches, we explore treating these multi-token law references as a single holistic law token and examining the implications for legal citation accuracy and differences in model interpretability. We train and compare two types of models: LawToken models, which encode the legal citations as a single law token, and LawBase models, which treat them as multi-token compounds. The results show that LawToken models outperform LawBase models on legal citation tasks, primarily due to fewer errors in the article numbering components. Further model representation analysis reveals that, while both models achieve comparable semantic representation quality, the multi-token-based LawBase suffers from degraded representations in multistep decoding, leading to more errors. Taken together, these findings suggest that form-meaning pairing can operate in a larger context, and this larger unit may offer advantages in future modeling of legal reasoning. In practice, this approach can significantly reduce the likelihood of hallucinations by anchoring legal citations as discrete, holistic tokens, thereby minimizing the risk of generating nonexistent or incorrect legal references.
Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge)@ LREC-COLING-2024 2024
Compressibility is closely related to the predictability of the texts from the information theory viewpoint. As large language models (LLMs) are trained to maximize the conditional probabilities of upcoming words, they may capture the subtlety and nuances of the semantic constraints underlying the texts, and texts aligning with the encoded semantic constraints are more compressible than those that do not. This paper systematically tests whether and how LLMs can act as compressors of semantic pairs. Using semantic relations from English and Chinese Wordnet, we empirically demonstrate that texts with correct semantic pairings are more compressible than incorrect ones, measured by the proposed compression advantages index. We also show that, with the Pythia model suite and a fine-tuned model on Chinese Wordnet, compression capacities are modulated by the model’s seen data. These findings are consistent with the view that LLMs encode the semantic knowledge as underlying constraints learned from texts and can act as compressors of semantic information or potentially other structured knowledge.
Frontiers in Language Sciences 2024
Formosan languages, spoken by the indigenous peoples of Taiwan, have unique roles in the reconstruction of Proto-Austronesian Languages. This paper presents a real-world Formosan language speech dataset, including 144 h of news footage for 16 Formosan languages, and uses self-supervised models to obtain and analyze their speech representations. Among the news footage, 13 h of the validated speech data of Formosan languages are selected, and a language classifier, based on XLSR-53, is trained to classify the 16 Formosan languages with an accuracy of 86%. We extracted and analyzed the speech vector representations learned from the model and compared them with 152 manually coded linguistic typological features. The comparison shows that the speech vectors reflect Formosan languages' phonological and morphological aspects. Furthermore, the speech vectors and linguistic features are used to construct a linguistic phylogeny, and the resulting genealogical grouping corresponds with previous literature. These results suggest that we can investigate the current real-world language usages through the speech model, and the dataset opens a window to look into the Formosan languages in vivo.
arXiv preprint arXiv:2401.09758 2024
Word sense disambiguation primarily addresses the lexical ambiguity of common words based on a predefined sense inventory. Conversely, proper names are usually considered to denote an ad-hoc real-world referent. Once the reference is decided, the ambiguity is purportedly resolved. However, proper names also exhibit ambiguities through appellativization, i.e., they act like common words and may denote different aspects of their referents. We proposed to address the ambiguities of proper names through the light of regular polysemy, which we formalized as dot objects. This paper introduces a combined word sense disambiguation (WSD) model for disambiguating common words against Chinese Wordnet (CWN) and proper names as dot objects. The model leverages the flexibility of a gloss-based model architecture, which takes advantage of the glosses and example sentences of CWN. We show that the model achieves competitive results on both common and proper nouns, even on a relatively sparse sense dataset. Aside from being a performant WSD tool, the model further facilitates the future development of the lexical resource.
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation 2023
Contextualized embeddings have proven to be powerful tools in various NLP tasks. However, their interpretability and how they encode lexical semantics remain challenging issues. In this paper, we tackle this problem by using definition modeling, a technique that aims to generate human-readable definitions for words, as a means to evaluate and understand high-dimensional semantic vectors. We introduce the Vec2Gloss model, which generates glosses from the contextualized embeddings of target words. The systematic gloss patterns provided by Chinese Wordnet enable us to examine the mechanism behind the model’s gloss generation. To delve deeper into this mechanism, we devise two dependency indices to measure the semantic and contextual dependencies of the generated glosses. These indices allow us to analyze the generated texts at both the gloss and token levels. Our results demonstrate that the proposed Vec2Gloss model enhances our understanding of lexical semantics in contextualized embeddings.
Proceedings of the 4th Conference on Language, Data and Knowledge 2023
Multimodal corpora have become an essential language resource for language science and grounded natural language processing (NLP) systems due to the growing need to understand and interpret human communication across various channels. In this paper, we first present our efforts in building the first Multimodal Corpus for Languages in Taiwan (MultiMoco). Based on the corpus, we conduct a case study investigating the Lexical Retrieval Hypothesis (LRH), specifically examining whether the hand gestures co-occurring with speech constants facilitate lexical retrieval or serve other discourse functions. With detailed annotations on eight parliamentary interpellations in Taiwan Mandarin, we explore the co-occurrence between speech constants and non-verbal features (i.e., head movement, face movement, hand gesture, and function of hand gesture). Our findings suggest that while hand gestures do serve as facilitators for lexical retrieval in some cases, they also serve the purpose of information emphasis. This study highlights the potential of the MultiMoco Corpus to provide an important resource for in-depth analysis and further research in multimodal communication studies.
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation 2023
This paper explores the grounding issue regarding multimodal semantic representation from a computational cognitive-linguistic view. We annotate images from the Flickr30k dataset with five perceptual properties: Affordance, Perceptual Salience, Object Number, Gaze Cueing, and Ecological Niche Association (ENA), and examine their association with textual elements in the image captions. Our findings reveal that images with Gibsonian affordance show a higher frequency of captions containing 'holding-verbs' and 'container-nouns' compared to images displaying telic affordance. Perceptual Salience, Object Number, and ENA are also associated with the choice of linguistic expressions. Our study demonstrates that comprehensive understanding of objects or events requires cognitive attention, semantic nuances in language, and integration across multiple modalities. We highlight the vital importance of situated meaning and affordance grounding in natural language understanding, with the potential to advance human-like interpretation in various scenarios.
Proceedings of the thirteenth language resources and evaluation conference 2022
Abstract Constructions are direct form-meaning pairs with possible schematic slots. These slots are simultaneously constrained by the embedded construction itself and the sentential context. We propose that the constraint could be described by a conditional probability distribution. However, as this conditional probability is inevitably complex, we utilize language models to capture this distribution. Therefore, we build CxLM, a deep learning-based masked language model explicitly tuned to constructions’ schematic slots. We first compile a construction dataset consisting of over ten thousand constructions in Taiwan Mandarin. Next, an experiment is conducted on the dataset to examine to what extent a pretrained masked language model is aware of the constructions. We then fine-tune the model specifically to perform a cloze task on the opening slots. We find that the fine-tuned model predicts masked slots more accurately than baselines and generates both structurally and semantically plausible word samples. Finally, we release CxLM and its dataset as publicly available resources and hope to serve as new quantitative tools in studying construction grammar.
2020 International Conference on Technologies and Applications of Artificial Intelligence (TAAI) 2020
The modern conversational agent requires high-quality datasets, which are often the bottlenecks when building models. This paper introduces MatDC, an entirely human-produced dialogue dataset with full semantic annotations in Chinese. The dataset features linguistic variations given users' intents and fully annotated semantic slots. MatDC dataset was completely human-edited, and the curation comprises two stages. At first, templates design stage, domain editors first construct schemas and compose ten dialogues between the agents and the users based on the back-end database. Secondly, in the dialogue rewrite stage, rewriters generate sentential variations for each template, under the constraints that the normalized slot values are kept unchanged. The underlying methodology of the MatDC is more open to extension and more adaptable to different domains. To demonstrate the applicability of the dataset, we build a dialogue agent with conventional pipeline architecture. We expect the MatDC dataset to provide additional training data and testing ground for dialogue agent studies.
Proceedings of the 33rd Pacific Asia conference on language, information and computation 2019
This paper proposes a computational model of idiomaticity for Chinese Quadra-syllabic idiomatic expressions based on variations, compoundness and compositeness measure. Two classification experiments are conducted to test the model, together with linguistic analysis of the connection to wordnet. The result is promising and we believe that it will shed more light on our understanding of cognitive dynamics that underlies multiword expressions processing.
Proceedings of the 10th Global Wordnet Conference 2019
Constructing semantic relations in WordNet has been a labour-intensive task, especially in a dynamic and fast-changing language environment. Combined with recent advancements of contextualized embeddings, this paper proposes the concept of morphology-guided sense vectors, which can be used to semi-automatically augment semantic relations in Chinese Wordnet (CWN). This paper (1) built sense vectors with pre-trained contextualized embedding models; (2) demonstrated the sense vectors computed were consistent with the sense distinctions made in CWN; and (3) predicted the potential semantically-related sense pairs with high accuracy by sense vectors model.
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) 2018
This paper presents a novel word granularity-aware annotation framework for Chinese. Anchored in current functionalist linguistics, this model rearranges the boundary of word segmentation and linguistic annotation, and gears toward a deeper understanding of lexical units and their behavior. The web-based annotation UI also supports flexible annotation tasks for various linguistic and affective phenomena.
Workshop on Chinese Lexical Semantics 2017
With the development of Teaching Chinese as an International Language and the professionalization trend of Chinese learning, legal Chinese becomes more and more important. To support the legal Chinese teaching and provide Chinese learners, Chinese teachers and other legal workers with authentic data, this paper constructs a legal corpus, which contains 35 legal texts of Mainland China. This study automatically segments the texts into words and manually checks all the segmentation results. Besides, through using the quantitative and qualitative analysis methods, this paper analyzes the common vocabulary of legal Chinese, analyzes the features of legal Chinese, compares the differences between the common vocabulary of legal Chinese and that of the international Chinese teaching, and compares the differences of the common meaning between legal Chinese words and common words in international Chinese vocabulary syllabus. This study also makes reference to the classification of Chinese word level in The Syllabus of Chinese Vocabulary and Characters Levels [18] to classify the words in the legal corpus and explores the application of this corpus in international Chinese teaching. This study finds that there are many differences between legal Chinese and general Chinese, in terms of the common vocabulary and the common meaning of words. So, it can be seen that the legal vocabulary has particularity in the teaching. We cannot directly utilize the existing vocabulary teaching methods to the teaching of legal Chinese vocabulary. Therefore, this paper puts forward several solutions for solving this problem.
// FRONTIER_RESEARCH