Publications

Balancing accuracy and efficiency: Evaluating encoder- and decoder-based models for word sense disambiguation and regular polysemy detection

Pin-Er Chen, Da-Chen Lian, Shu-Kai Hsieh

Natural Language Processing (First View) 2026

This study investigates the nuanced challenges of fine-grained word sense disambiguation (WSD) tasks with regular polysemy detection (RPD) of the named entity, focusing on evaluating the trade-offs between encoder and decoder-based model performance and computational efficiency. The datasets, including Chinese Wordnet 2.0 (CWN) as sense inventory, the Social Media Corpus (PTT) for user-generated content, and the Academia Sinica Balanced Corpus (ASBC) for formal linguistic data, were chosen to provide a diverse and representative framework for evaluating both common nouns and proper nouns with regular polysemy in Taiwan Mandarin. This analysis evaluated ten encoder- and decoder-based models, assessing their performance on two tasks. The encoder-based models demonstrate comparable accuracy to the decoder-based models on WSD tasks (77.5% vs. 78.5%), and similarly strong performance in RPD tasks (84.2% vs. 83.8%). On a large-scale all-words WSD task, the encoder model not only outperformed the decoder model but also generated substantially lower carbon emissions – an eight-fold reduction. These differences underscore the trade-offs between model architecture and task-specific performance, highlighting the necessity for balancing performance and energy efficiency in the design and application of language models, advocating for sustainable and eco-friendly practices in natural language processing development.

paper source

CorPilot: An Agentic Framework for Corpus Linguistics

Da-Chen Lian, Mao-Chang Ku, Shu-Kai Hsieh

Intelligent Computing (Computing Conference 2026) 2026

This study introduces CorPilot, an innovative multi-agent framework that integrates Large Language Models (LLMs) to function as a form of collaborative AI, designed to automate and streamline corpus linguistic research. By structuring interactions between specialized agents within this agentic framework, CorPilot assists with tasks such as querying, annotation, semantic classification, and data analysis. Demonstrated through an empirical case study of Mandarin Chinese constructions, CorPilot replicates existing manual annotation results and reveals finer semantic distinctions and novel linguistic patterns previously unattainable through traditional methods. Our findings illustrate that CorPilot represents a significant methodological advancement, addressing challenges in scalability, replicability, and interpretability of corpus linguistics research. The framework's modular design also facilitates future extensions into various linguistic domains, holding considerable potential for theoretical and practical advancements in linguistics and computational humanities.

paper source

When Structure Matters: Cross-Lingual Hyperbolic Embeddings for Chinese and English Wordnets

Mao-Chang Ku, Da-Chen Lian, Pin-Er Chen, Po-Ya Angela Wang, Wei-Ling Chen, Shu-Kai Hsieh

Language Resources and Evaluation Conference (LREC) 2026 2026

Hyperbolic embeddings such as the Poincaré model effectively represent lexical hierarchies with low distortion, yet their cross-lingual generalizability remains largely unexplored. This study investigates cross-lingual transfer by training 20-dimensional Poincaré embeddings exclusively on Open English WordNet (OEWN) hypernymy relations and evaluating on aligned Chinese Wordnet (CWN) synsets under a vocabulary-constrained transfer setting, where CWN-relevant synsets appear in OEWN training data but no Chinese-language supervision is used. We report robust statistical evidence based on the final 10 training checkpoints: Poincaré embeddings achieve 2.57× higher Mean Reciprocal Rank (MRR) than Euclidean embeddings on CWN (0.030 ± 0.001 vs 0.012 ± 0.000, p < 0.001, Cohen’s d = 34.48) and 5.61× higher on OEWN (0.016 ± 0.000 vs 0.003 ± 0.000, p < 0.001, d = 42.48). Furthermore, hierarchical filtering leveraging the radial dimension of hyperbolic space provides substantial additional gains: +74.6% MRR improvement on CWN and +25.8% on OEWN (both p < 0.001). The model achieves higher absolute performance on the zero-shot CWN test set (MRR = 0.052 ± 0.002) than on the in-domain OEWN test set (MRR = 0.020 ± 0.001). We attribute this to structural alignment: CWN’s broader branching factor (4.32 vs 1.10) and moderate depth naturally suit hyperbolic geometry’s capacity to compactly represent hierarchies. Our findings demonstrate that geometric properties learned from English hypernymy transfer robustly across languages when semantic structures align. We release the aligned CWN–OEWN hypernymy evaluation dataset and complete evaluation framework to facilitate future research on geometry-based cross-lingual semantic modeling.

paper source

Vigesimality on an Implicational Scale: a Case Study of the Decimal-Vigesimal Continuum in Tibeto-Burman

Tung-Le Pan

The 35th Annual Meeting of the Southeast Asian Linguistics Society (SEALS 35), Nanyang Technological University, Singapore 2026

Using perspectival words is harder than vocabulary words for humans—and even more so for multimodal language models

Dota Tianai Dong, Yifan Luo, Po-Ya Angela Wang, Asli Özyürek, Paula Rubio-Fernández

Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL), San Diego, California 2026

Multimodal language models (MLMs) increasingly demonstrate human-like communication, yet their use of everyday perspectival words remains poorly understood. To address this gap, we compare humans and MLMs in their use of three word types that impose increasing cognitive demands: vocabulary (for example, "boat" or "cup"), possessives (for example, "mine" versus "yours"), and demonstratives (for example, "this one" versus "that one"). Testing seven MLMs against human participants, we find that perspectival words are harder than vocabulary words for both groups. The gap is larger for MLMs: while models approach human-level performance on vocabulary, they show clear deficits with possessives and even greater difficulty with demonstratives. Ablation analyses indicate that limitations in perspective-taking and spatial reasoning are key sources of these gaps. Instruction-based prompting reduces the gap for possessives but leaves demonstratives far below human performance. These results show that, unlike vocabulary, perspectival words pose a greater challenge in human communication, and this difficulty is amplified in MLMs, revealing a shortfall in their pragmatic and social-cognitive abilities.

paper source

Multimodal Lexical Item

Po-Ya Angela Wang, Shu-Kai Hsieh

Reference Module in Social Sciences 2026

Multimodal lexical items (MLIs) are conceptual entities whose meanings emerge from the dynamic integration of multiple communicative channels (e.g., text, gesture, imagery, prosody). This chapter provides an interdisciplinary, data-driven overview of how researchers from corpus linguistics, computational linguistics, neuro-psycholinguistics, and cognitive science collectively explore MLIs. We survey frameworks related to MLI semantic representation (e.g. Multimodal Distributional Semantics and Multimodal Semantics for Affordances and Actions), multimodal context of MLIs (e.g., Multimodal Construction Grammar), and multimodal lexical resources (e.g., Frame2). By examining theoretical constructs, like embodiment and affordances, and empirical methodologies, such as corpus annotation, machine learning, and experimental research, we underscore the multifaceted nature of lexical meaning in a richly multimodal world. Ultimately, we propose future directions for a more holistic, context-sensitive approach that unites traditional linguistic structures with perceptual, embodied, and dynamic environmental cues to investigate MLIs.

source

Capturing Ancient Chinese Sense Induction with Automatic Pipelines

Guan-Yu Tseng, Chunki Lim, Chih-Han Lin, Tung-Le Pan, Yu-Chieh Wang, Lang-Ching Yeh, Shu-Kai Hsieh

Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) 2026

LawToken: a single token worth more than its constituents

Yu-Hsiang Tseng, Hsin-Yu Chou, Shu-Kai Hsieh

Proceedings of the 29th Conference on Computational Natural Language Learning 2025

Legal citations require correctly recalling the law references of complex law article names and article numbering, which large language models typically treat as multi-token sequences. Motivated by the form-meaning pair of constructionist approaches, we explore treating these multi-token law references as a single holistic law token and examining the implications for legal citation accuracy and differences in model interpretability. We train and compare two types of models: LawToken models, which encode the legal citations as a single law token, and LawBase models, which treat them as multi-token compounds. The results show that LawToken models outperform LawBase models on legal citation tasks, primarily due to fewer errors in the article numbering components. Further model representation analysis reveals that, while both models achieve comparable semantic representation quality, the multi-token-based LawBase suffers from degraded representations in multistep decoding, leading to more errors. Taken together, these findings suggest that form-meaning pairing can operate in a larger context, and this larger unit may offer advantages in future modeling of legal reasoning. In practice, this approach can significantly reduce the likelihood of hallucinations by anchoring legal citations as discrete, holistic tokens, thereby minimizing the risk of generating nonexistent or incorrect legal references.

paper source

Empowering Elementary Learning: Utilizing Large Language Models to Craft Tailored Textbooks with Expert Insight

Da-Chen Lian, Mao-Chang Ku, Po-Ya Angela Wang, Wei-Ling Chen, Shu-Kai Hsieh

Journal of Library and Information Studies 2025

Large language models (LLMs) have in recent years spurred research across various sectors, owing to their remarkable zero-shot or few-shot performance. This capability has become indispensable for individuals seeking to integrate these language models into their workflows effectively. In this paper, based on in-depth linguistic analyses, we explore the application of an LLM, specifically GPT-4, in generating Chinese language textbooks tailored for grade school students. This encompasses the creation of main lesson texts alongside accompanying Chinese character exercises. Experimental results suggest that the LLM-generated textbook lessons are a viable research direction. The initial outcomes demonstrate the ability of LLM to generate texts of satisfactory quality appropriate for a specified grade level. The contributions of this work include pioneering the quantitative analysis of Chinese language textbooks for native speakers in Taiwan and leveraging an LLM to automatically generate textbook content and accompanying Chinese character exercises targeted at native Chinese speakers, which is a novel approach facilitated by the development of prompts tailored to different language learning levels. The study also conducts quantitative and qualitative comparisons between machine-generated lessons and those developed by educational professionals in Taiwan.

Balancing accuracy and efficiency: Evaluating encoder- and decoder-based models for word sense disambiguation and regular polysemy detection

CorPilot: An Agentic Framework for Corpus Linguistics

When Structure Matters: Cross-Lingual Hyperbolic Embeddings for Chinese and English Wordnets

Vigesimality on an Implicational Scale: a Case Study of the Decimal-Vigesimal Continuum in Tibeto-Burman

Using perspectival words is harder than vocabulary words for humans—and even more so for multimodal language models

Multimodal Lexical Item

Capturing Ancient Chinese Sense Induction with Automatic Pipelines

LawToken: a single token worth more than its constituents

Empowering Elementary Learning: Utilizing Large Language Models to Craft Tailored Textbooks with Expert Insight

Continual Pre-Training is (not) What You Need in Domain Adaption

A Corpus-based Study of Causative Constructions in Paiwan

The semantic relations in LLMs: An information-theoretic compression approach

Self-supervised learning for Formosan speech representation and linguistic phylogeny

Resolving regular polysemy in named entities

Building a Semantic Search Platform for Exploring Historical Chinese Corpora

Vec2Gloss: definition modeling leveraging contextualized vectors with Wordnet gloss

Solving linguistic olympiad problems with tree-of-thought prompting

Prompt-based translation of Chinese into Taiwanese mandarin Braille

Lexical Retrieval Hypothesis in Multimodal Context

Incorporating structural topic modeling into short text analysis

Exploring affordance and situated meaning in image captions: A multimodal analysis

Evaluating interfaced llm bias

DeepLEX

The extreme poverty of affixation in Chinese: rarely derivational and hardly affixational

Religion, cognition, and emotion: What can automated text analysis tell us about culture?

CxLM: A construction and context-aware language model

Character Jacobian: Modeling Chinese Character Meanings with Deep Learning Model

Analyzing Discourse Functions with Acoustic Features and Phone mbeddings: Non-lexical Items in Taiwan Mandarin

Neuro-cognitive differences in semantic processing between native speakers and proficient learners of mandarin Chinese

Keyword-centered Collocating Topic Analysis

Exploring sentiment constructions: connecting deep learning models with linguistic construction

Mitigating Impacts of Word Segmentation Errors on Collocation Extraction in Chinese

MatDC: A Multi-turn Multi-domain Annotated Task-oriented Dialogue Dataset in Chinese

Lectal variation of the two Chinese causative auxiliaries

From Sense to Action: A Word-Action Disambiguation Task in NLP

Exploring Discourse on Same-sex Marriage in Taiwan: A Case Study of Near-Synonym of HOMOSEXUAL in Opposing Stances

Do you believe it happened? Assessing Chinese readers’ veridicality judgments

Computational Representation of Chinese Characters: Comparison Between Singular Value Decomposition and Variational Autoencoder.

Computational modeling of affixoid behavior in chinese morphology

An analysis of multimodal document intent in instagram posts

The secret to popular Chinese web novels: A corpus-driven study

Modeling the idiomaticity of Chinese Quadra-syllabic idiomatic expressions

Extracting Semantic Representations of Sexual Biases from Word Vectors

Eigencharacter: An embedding of Chinese character orthography

Augmenting Chinese WordNet semantic relations with contextualized embeddings

Sinitic Wordnet: Laying the Groundwork with Chinese Varieties Written in Traditional Characters

Multiple scaffolding mechanisms for L2 syntactic processing: An Event-Related Potential study

LINKING BASIC LEXICON TO SHARED ONTOLOGY FOR ENDANGERED LANGUAGES: A LINKED DATA APPROACH TOWARD FORMOSAN LANGUAGES/瀬危语言基本词库与上层知识本体的链接一关联数据在台湾南岛语研 …

Fluid annotation: A granularity-aware annotation tool for Chinese word fluidity

Filtered collocations as features in verbal polysemy disambiguation: A case study of the Chinese verb kao ‘bake’

Mandarin Chinese words and parts of speech: A corpus-based study

Exploring Lavender Tongue from Social Media Texts [In Chinese]

Entrenchment and creativity in chinese quadrasyllabic idiomatic expressions

ClassifierGuesser: A Context-based Classifier Prediction System for Chinese Language Learners

A corpus-based study of the recurrent lexical bundle ka li kong ‘let (me) tell you’ in Taiwanese Southern Min conversations

Yet Another Resource to Sketch Word Behavior in Chinese Variation

Word dependency sketch for Chinese language learning

Sentiment detection in micro-blogs using unsupervised chunk extraction

Sarcasm detection in chinese using a crowdsourced corpus

Sarcasm detection in chinese using a crowdsourced corpus

Mismatches in verb complements: A corpus-based study of the complement coercion operation in Chinese

Evaluative Pattern Extraction for Automated Text Generation

Crowdsourcing Experiment Designs for Chinese Word Sense Annotation

Linguistic linked data in chinese: The case of chinese wordnet

Degree Modification in Mandarin: A Case Study of Creative Degree Modifier 各種 [Gezhong]

Chinese lexical semantics

An Arguing Lexicon for Stance Classification on Short Text Comments in Chinese

Why chinese web-as-corpus is wacky? Or: How big data is killing chinese corpus linguistics

Skillex: a graph-based lexical score for measuring the semantic efficiency of used verbs by human subjects describing actions

Skillex, an action labelling efficiency score: the case for french and mandarin

Sketching the Dependency Relations of Words in Chinese

Public opinion toward CSSTA: A text mining approach

Leveraging morpho-semantics for the discovery of relations in chinese wordnet

Latent semantic distance between Chinese basic words and non-basic words

A multilingual lexico-semantic database and ontology

To coerce or not to coerce: A corpus-based exploration of some complement coercion verbs in Chinese

Qualia Modification in Mandarin Neologism: A Case Study on Prefix" Wéi 微"

Observing features of PTT neologisms: A corpus-driven study with N-gram model

Features of Verb Complements in Co-composition: A case study of Chinese baking verb using Weibo corpus