Da-Chen Lian

When Structure Matters: Cross-Lingual Hyperbolic Embeddings for Chinese and English Wordnets

Mao-Chang Ku, Da-Chen Lian, Pin-Er Chen, Po-Ya Angela Wang, Wei-Ling Chen, Shu-Kai Hsieh

Language Resources and Evaluation Conference (LREC) 2026 2026

Empowering Elementary Learning: Utilizing Large Language Models to Craft Tailored Textbooks with Expert Insight

Da-Chen Lian, Mao-Chang Ku, Po-Ya Angela Wang, Wei-Ling Chen, Shu-Kai Hsieh

Journal of Library and Information Studies 2025

Large language models (LLMs) have in recent years spurred research across various sectors, owing to their remarkable zero-shot or few-shot performance. This capability has become indispensable for individuals seeking to integrate these language models into their workflows effectively. In this paper, based on in-depth linguistic analyses, we explore the application of an LLM, specifically GPT-4, in generating Chinese language textbooks tailored for grade school students. This encompasses the creation of main lesson texts alongside accompanying Chinese character exercises. Experimental results suggest that the LLM-generated textbook lessons are a viable research direction. The initial outcomes demonstrate the ability of LLM to generate texts of satisfactory quality appropriate for a specified grade level. The contributions of this work include pioneering the quantitative analysis of Chinese language textbooks for native speakers in Taiwan and leveraging an LLM to automatically generate textbook content and accompanying Chinese character exercises targeted at native Chinese speakers, which is a novel approach facilitated by the development of prompts tailored to different language learning levels. The study also conducts quantitative and qualitative comparisons between machine-generated lessons and those developed by educational professionals in Taiwan.

paper

Continual Pre-Training is (not) What You Need in Domain Adaption

Pin-Er Chen, Da-Chen Lian, Shu-Kai Hsieh, Sieh-Chuen Huang, Hsuan-Lei Shao, Jun-Wei Chiu, Yang-Hsien Lin, Zih-Ching Chen, Eddie TC Huang, Simon See

arXiv preprint arXiv:2504.13603 2025

The recent advances in Legal Large Language Models (LLMs) have transformed the landscape of legal research and practice by automating tasks, enhancing research precision, and supporting complex decision-making processes. However, effectively adapting LLMs to the legal domain remains challenging due to the complexity of legal reasoning, the need for precise interpretation of specialized language, and the potential for hallucinations. This paper examines the efficacy of Domain-Adaptive Continual Pre-Training (DACP) in improving the legal reasoning capabilities of LLMs. Through a series of experiments on legal reasoning tasks within the Taiwanese legal framework, we demonstrate that while DACP enhances domain-specific knowledge, it does not uniformly improve performance across all legal tasks. We discuss the trade-offs involved in DACP, particularly its impact on model generalization and performance in prompt-based tasks, and propose directions for future research to optimize domain adaptation strategies in legal AI.

paper source

The semantic relations in LLMs: An information-theoretic compression approach

Yu-Hsiang Tseng, Pin-Er Chen, Da-Chen Lian, Shu-Kai Hsieh

Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge)@ LREC-COLING-2024 2024

Compressibility is closely related to the predictability of the texts from the information theory viewpoint. As large language models (LLMs) are trained to maximize the conditional probabilities of upcoming words, they may capture the subtlety and nuances of the semantic constraints underlying the texts, and texts aligning with the encoded semantic constraints are more compressible than those that do not. This paper systematically tests whether and how LLMs can act as compressors of semantic pairs. Using semantic relations from English and Chinese Wordnet, we empirically demonstrate that texts with correct semantic pairings are more compressible than incorrect ones, measured by the proposed compression advantages index. We also show that, with the Pythia model suite and a fine-tuned model on Chinese Wordnet, compression capacities are modulated by the model’s seen data. These findings are consistent with the view that LLMs encode the semantic knowledge as underlying constraints learned from texts and can act as compressors of semantic information or potentially other structured knowledge.

paper source

Self-supervised learning for Formosan speech representation and linguistic phylogeny

Shu-Kai Hsieh, Yu-Hsiang Tseng, Da-Chen Lian, Chi-Wei Wang

Frontiers in Language Sciences 2024

Formosan languages, spoken by the indigenous peoples of Taiwan, have unique roles in the reconstruction of Proto-Austronesian Languages. This paper presents a real-world Formosan language speech dataset, including 144 h of news footage for 16 Formosan languages, and uses self-supervised models to obtain and analyze their speech representations. Among the news footage, 13 h of the validated speech data of Formosan languages are selected, and a language classifier, based on XLSR-53, is trained to classify the 16 Formosan languages with an accuracy of 86%. We extracted and analyzed the speech vector representations learned from the model and compared them with 152 manually coded linguistic typological features. The comparison shows that the speech vectors reflect Formosan languages' phonological and morphological aspects. Furthermore, the speech vectors and linguistic features are used to construct a linguistic phylogeny, and the resulting genealogical grouping corresponds with previous literature. These results suggest that we can investigate the current real-world language usages through the speech model, and the dataset opens a window to look into the Formosan languages in vivo.

source

Evaluating interfaced llm bias

Kai-Ching Yeh, Jou-An Chi, Da-Chen Lian, Shu-Kai Hsieh

Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023) 2023

In this research, we comprehensively analyze the potential biases inherent in Large Language Model, utilizing meticulously curated input data to ascertain the extent to which such data sway machine-generated responses to yield prejudiced outcomes. Notwithstanding recent strides in mitigating bias in LLM-based NLP, our findings underscore the continued susceptibility of these models to data-driven bias. We have integrated the PTT NTU board as our primary data source for this investigation. Moreover, our study elucidates that, in certain contexts, machines may manifest biases without supplementary prompts. However, they can be guided toward rendering impartial responses when provided with enhanced contextual nuances.

paper

MatDC: A Multi-turn Multi-domain Annotated Task-oriented Dialogue Dataset in Chinese

Yu-Hsiang Tseng, Shu-Kai Hsieh, Richard Lian, Chiung-Yu Chiang, Yu-Lin Chang, Li-Ping Chang, Ji-Lung Hsieh

2020 International Conference on Technologies and Applications of Artificial Intelligence (TAAI) 2020

The modern conversational agent requires high-quality datasets, which are often the bottlenecks when building models. This paper introduces MatDC, an entirely human-produced dialogue dataset with full semantic annotations in Chinese. The dataset features linguistic variations given users' intents and fully annotated semantic slots. MatDC dataset was completely human-edited, and the curation comprises two stages. At first, templates design stage, domain editors first construct schemas and compose ten dialogues between the agents and the users based on the back-end database. Secondly, in the dialogue rewrite stage, rewriters generate sentential variations for each template, under the constraints that the normalized slot values are kept unchanged. The underlying methodology of the MatDC is more open to extension and more adaptable to different domains. To demonstrate the applicability of the dataset, we build a dialogue agent with conventional pipeline architecture. We expect the MatDC dataset to provide additional training data and testing ground for dialogue agent studies.

source

連大成

Research

Affiliated Projects

MultiMoCo NTU

Academic Output

Affiliated Publications

When Structure Matters: Cross-Lingual Hyperbolic Embeddings for Chinese and English Wordnets

Empowering Elementary Learning: Utilizing Large Language Models to Craft Tailored Textbooks with Expert Insight

Continual Pre-Training is (not) What You Need in Domain Adaption

The semantic relations in LLMs: An information-theoretic compression approach

Self-supervised learning for Formosan speech representation and linguistic phylogeny

Evaluating interfaced llm bias

MatDC: A Multi-turn Multi-domain Annotated Task-oriented Dialogue Dataset in Chinese

Let's explore language
frontiers together.

USEFUL_LINKS

LOCATE_US

Da-Chen Lian

連大成

Research

Affiliated Projects

MultiMoCo NTU

Academic Output

Affiliated Publications

When Structure Matters: Cross-Lingual Hyperbolic Embeddings for Chinese and English Wordnets

Empowering Elementary Learning: Utilizing Large Language Models to Craft Tailored Textbooks with Expert Insight

Continual Pre-Training is (not) What You Need in Domain Adaption

The semantic relations in LLMs: An information-theoretic compression approach

Self-supervised learning for Formosan speech representation and linguistic phylogeny

Evaluating interfaced llm bias

MatDC: A Multi-turn Multi-domain Annotated Task-oriented Dialogue Dataset in Chinese

Let's explore languagefrontiers together.

Let's explore language
frontiers together.