When Structure Matters: Cross-Lingual Hyperbolic Embeddings for Chinese and English Wordnets
Language Resources and Evaluation Conference (LREC) 2026 2026
Academic Output
Browse peer-reviewed papers, books, and conference contributions connected to LOPE research.
Language Resources and Evaluation Conference (LREC) 2026 2026
The 35th Annual Meeting of the Southeast Asian Linguistics Society (SEALS 35), Nanyang Technological University, Singapore 2026
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL), San Diego, California 2026
International Encyclopedia of Language and Linguistics, 3rd Edition 2026
14th Computing Conference 2026 2026
Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) 2026
Proceedings of the 29th Conference on Computational Natural Language Learning 2025
Legal citations require correctly recalling the law references of complex law article names and article numbering, which large language models typically treat as multi-token sequences. Motivated by the form-meaning pair of constructionist approaches, we explore treating these multi-token law references as a single holistic law token and examining the implications for legal citation accuracy and differences in model interpretability. We train and compare two types of models: LawToken models, which encode the legal citations as a single law token, and LawBase models, which treat them as multi-token compounds. The results show that LawToken models outperform LawBase models on legal citation tasks, primarily due to fewer errors in the article numbering components. Further model representation analysis reveals that, while both models achieve comparable semantic representation quality, the multi-token-based LawBase suffers from degraded representations in multistep decoding, leading to more errors. Taken together, these findings suggest that form-meaning pairing can operate in a larger context, and this larger unit may offer advantages in future modeling of legal reasoning. In practice, this approach can significantly reduce the likelihood of hallucinations by anchoring legal citations as discrete, holistic tokens, thereby minimizing the risk of generating nonexistent or incorrect legal references.
Journal of Library and Information Studies 2025
Large language models (LLMs) have in recent years spurred research across various sectors, owing to their remarkable zero-shot or few-shot performance. This capability has become indispensable for individuals seeking to integrate these language models into their workflows effectively. In this paper, based on in-depth linguistic analyses, we explore the application of an LLM, specifically GPT-4, in generating Chinese language textbooks tailored for grade school students. This encompasses the creation of main lesson texts alongside accompanying Chinese character exercises. Experimental results suggest that the LLM-generated textbook lessons are a viable research direction. The initial outcomes demonstrate the ability of LLM to generate texts of satisfactory quality appropriate for a specified grade level. The contributions of this work include pioneering the quantitative analysis of Chinese language textbooks for native speakers in Taiwan and leveraging an LLM to automatically generate textbook content and accompanying Chinese character exercises targeted at native Chinese speakers, which is a novel approach facilitated by the development of prompts tailored to different language learning levels. The study also conducts quantitative and qualitative comparisons between machine-generated lessons and those developed by educational professionals in Taiwan.
arXiv preprint arXiv:2504.13603 2025
The recent advances in Legal Large Language Models (LLMs) have transformed the landscape of legal research and practice by automating tasks, enhancing research precision, and supporting complex decision-making processes. However, effectively adapting LLMs to the legal domain remains challenging due to the complexity of legal reasoning, the need for precise interpretation of specialized language, and the potential for hallucinations. This paper examines the efficacy of Domain-Adaptive Continual Pre-Training (DACP) in improving the legal reasoning capabilities of LLMs. Through a series of experiments on legal reasoning tasks within the Taiwanese legal framework, we demonstrate that while DACP enhances domain-specific knowledge, it does not uniformly improve performance across all legal tasks. We discuss the trade-offs involved in DACP, particularly its impact on model generalization and performance in prompt-based tasks, and propose directions for future research to optimize domain adaptation strategies in legal AI.
外國語文研究 2025
This study examines the variation in causative affixes (pa-∅-, pa-ka-, and pa-pe-) in Paiwan using a corpus-based approach. Building on Tang's (295) identification of the ∅-, ka-, and pe- affixes in Paiwan causative constructions, we apply logistic regression analysis to data extracted from corpora. Our research suggests a cognitive distinction among the three causative subtypes (Verhagen & Kemmer). The regression model results support the theory of a direct/indirect causation dichotomy, offering a plausible explanation for the characteristics and lexical meanings of the affixes. Specifically, the affix pa-∅- is associated with "direct causation," typically used in events involving inanimate participants where the cause directly results in the state of the causee. Conversely, the affix pa-ka- is linked to "indirect causation," often found in contexts with animate participants and additional contributing forces. The affix pa-pe- occupies an intermediary position, showing a preference for intransitive effected predicates. Additionally, this study conducts a cross-linguistic comparison of the Paiwan causative affixes with the causative verbs doen and laten in Dutch, and shi and rang in Mandarin. These findings enhance our understanding of Paiwan causative constructions and offer insights into the universality and specificity of causative structures in linguistic typology.
Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge)@ LREC-COLING-2024 2024
Compressibility is closely related to the predictability of the texts from the information theory viewpoint. As large language models (LLMs) are trained to maximize the conditional probabilities of upcoming words, they may capture the subtlety and nuances of the semantic constraints underlying the texts, and texts aligning with the encoded semantic constraints are more compressible than those that do not. This paper systematically tests whether and how LLMs can act as compressors of semantic pairs. Using semantic relations from English and Chinese Wordnet, we empirically demonstrate that texts with correct semantic pairings are more compressible than incorrect ones, measured by the proposed compression advantages index. We also show that, with the Pythia model suite and a fine-tuned model on Chinese Wordnet, compression capacities are modulated by the model’s seen data. These findings are consistent with the view that LLMs encode the semantic knowledge as underlying constraints learned from texts and can act as compressors of semantic information or potentially other structured knowledge.
Frontiers in Language Sciences 2024
Formosan languages, spoken by the indigenous peoples of Taiwan, have unique roles in the reconstruction of Proto-Austronesian Languages. This paper presents a real-world Formosan language speech dataset, including 144 h of news footage for 16 Formosan languages, and uses self-supervised models to obtain and analyze their speech representations. Among the news footage, 13 h of the validated speech data of Formosan languages are selected, and a language classifier, based on XLSR-53, is trained to classify the 16 Formosan languages with an accuracy of 86%. We extracted and analyzed the speech vector representations learned from the model and compared them with 152 manually coded linguistic typological features. The comparison shows that the speech vectors reflect Formosan languages' phonological and morphological aspects. Furthermore, the speech vectors and linguistic features are used to construct a linguistic phylogeny, and the resulting genealogical grouping corresponds with previous literature. These results suggest that we can investigate the current real-world language usages through the speech model, and the dataset opens a window to look into the Formosan languages in vivo.
arXiv preprint arXiv:2401.09758 2024
Word sense disambiguation primarily addresses the lexical ambiguity of common words based on a predefined sense inventory. Conversely, proper names are usually considered to denote an ad-hoc real-world referent. Once the reference is decided, the ambiguity is purportedly resolved. However, proper names also exhibit ambiguities through appellativization, i.e., they act like common words and may denote different aspects of their referents. We proposed to address the ambiguities of proper names through the light of regular polysemy, which we formalized as dot objects. This paper introduces a combined word sense disambiguation (WSD) model for disambiguating common words against Chinese Wordnet (CWN) and proper names as dot objects. The model leverages the flexibility of a gloss-based model architecture, which takes advantage of the glosses and example sentences of CWN. We show that the model achieves competitive results on both common and proper nouns, even on a relatively sparse sense dataset. Aside from being a performant WSD tool, the model further facilitates the future development of the lexical resource.
じんもんこん 2024 論文集 2024
This work introduces a historical corpus of the Chinese language spanning approximately 3,000 years and proposes a new corpus search system utilizing word embedding techniques and large language models (LLMs). The system adopts a hybrid search method that combines traditional keyword search with vector-based search based on semantic relationships. This approach enables searches for semantically similar words and visualizations of semantic change, which were challenging with conventional corpus search methods. Additionally, based on the collected corpus data, we implemented a feature to visualize changes in word meanings across specific periods and media types. This interface allows for a multifaceted analysis of language evolution, demonstrating a more effective analytical approach than traditional methods.
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation 2023
Contextualized embeddings have proven to be powerful tools in various NLP tasks. However, their interpretability and how they encode lexical semantics remain challenging issues. In this paper, we tackle this problem by using definition modeling, a technique that aims to generate human-readable definitions for words, as a means to evaluate and understand high-dimensional semantic vectors. We introduce the Vec2Gloss model, which generates glosses from the contextualized embeddings of target words. The systematic gloss patterns provided by Chinese Wordnet enable us to examine the mechanism behind the model’s gloss generation. To delve deeper into this mechanism, we devise two dependency indices to measure the semantic and contextual dependencies of the generated glosses. These indices allow us to analyze the generated texts at both the gloss and token levels. Our results demonstrate that the proposed Vec2Gloss model enhances our understanding of lexical semantics in contextualized embeddings.
Proceedings of the 35th conference on computational linguistics and speech processing (ROCLING 2023) 2023
In this study, we delve into the efficacy of the Tree-of-Thought Prompting technique as a mechanism to address linguistic challenges and augment the reasoning capabilities of expansive language models. Specifically, we scrutinize the reasoning prowess of the Generative Pre-trained Transformer (GPT) model, which has garnered significant attention within the research and practitioner community. Utilizing the Tree-of-Thought Prompting methodology, we assess its utility in enhancing both the precision and response latency of the GPT model, especially for Linguistic Olympiad tasks demanding elevated reasoning competencies. Concurrently, we delineate inherent limitations within this approach and proffer avenues for future research to refine and optimize it. Code repo: https://github.com/chrizeroxtwo/ToT-LinguisticProblem
2023 International Conference on Asian Language Processing (IALP) 2023
In automated Braille translation, accommodating linguistic nuances and the rules peculiar to Braille across various languages poses considerable challenges. Mandarin Chinese stands out in this aspect due to its necessity to ascertain the appropriate pronunciation of characters based on context. Although rule-based algorithms have historically dominated this space, recent empirical evidence highlights the efficacy of statistical approaches and the emergent exploration of Large Language Model (LLM)-based techniques. This paper explores the potential advantages of leveraging a prompt-based strategy for the automated translation from Mandarin Chinese to Taiwanese Mandarin Braille. As a methodology, we devised a script capable of ingesting a Chinese sentence and subsequently generating a prompt that comprises the Zhuyin of unequivocal characters and dictionary definitions for those with polysemous readings. Utilizing a set of 103 test sentences, we assessed the precision with which GPT-3.5, GPT-4, and Liblouis (a widely-recognized opensource rule-based Braille translator) ascribed readings to polyphonic characters. Our findings revealed that, notwithstanding certain inconsistencies in the GPT-3.5 outputs, the extended GPT4 model exhibited superior performance compared to Liblouis.
Proceedings of the 4th Conference on Language, Data and Knowledge 2023
Multimodal corpora have become an essential language resource for language science and grounded natural language processing (NLP) systems due to the growing need to understand and interpret human communication across various channels. In this paper, we first present our efforts in building the first Multimodal Corpus for Languages in Taiwan (MultiMoco). Based on the corpus, we conduct a case study investigating the Lexical Retrieval Hypothesis (LRH), specifically examining whether the hand gestures co-occurring with speech constants facilitate lexical retrieval or serve other discourse functions. With detailed annotations on eight parliamentary interpellations in Taiwan Mandarin, we explore the co-occurrence between speech constants and non-verbal features (i.e., head movement, face movement, hand gesture, and function of hand gesture). Our findings suggest that while hand gestures do serve as facilitators for lexical retrieval in some cases, they also serve the purpose of information emphasis. This study highlights the potential of the MultiMoco Corpus to provide an important resource for in-depth analysis and further research in multimodal communication studies.
Concentric 2023
The past few decades have seen the rapid development of topic modeling. So far, research has been more concerned with determining the ideal number of topics or meaningful topic clustering words than with applying topic modeling techniques to evaluate linguistic theories. This study proposes the Structural Topic Model (STM)-led framework to facilitate the interpretation of topic modeling results and standardize text analysis. STM encompasses various model training mechanisms, thereby requiring systematic designs to properly combine language studies. “Structural” in STM refers to the inclusion of metadata structure. Unlike the corpus-based keyness approach, STM can capture contextual cues and meta-information for the interpretation of topical results. Besides, STM can make cross-corpora comparisons via topical contrast, a challenging task for corpus-driven related models such as the Biterm Topic Model (BTM). Stylistic variations in song lyrics are taken as an illustration to show how to use the suggested framework to delve into the linguistic theory proposed by Pennebaker (2013). The topical model and iterable model in the proposed paradigm can clarify how pronouns affect style distinction. We believe the proposed STM-led framework can shed light on text analysis by conducting a reproducible cross-corpora comparison on short texts.
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation 2023
This paper explores the grounding issue regarding multimodal semantic representation from a computational cognitive-linguistic view. We annotate images from the Flickr30k dataset with five perceptual properties: Affordance, Perceptual Salience, Object Number, Gaze Cueing, and Ecological Niche Association (ENA), and examine their association with textual elements in the image captions. Our findings reveal that images with Gibsonian affordance show a higher frequency of captions containing 'holding-verbs' and 'container-nouns' compared to images displaying telic affordance. Perceptual Salience, Object Number, and ENA are also associated with the choice of linguistic expressions. Our study demonstrates that comprehensive understanding of objects or events requires cognitive attention, semantic nuances in language, and integration across multiple modalities. We highlight the vital importance of situated meaning and affordance grounding in natural language understanding, with the potential to advance human-like interpretation in various scenarios.
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023) 2023
In this research, we comprehensively analyze the potential biases inherent in Large Language Model, utilizing meticulously curated input data to ascertain the extent to which such data sway machine-generated responses to yield prejudiced outcomes. Notwithstanding recent strides in mitigating bias in LLM-based NLP, our findings underscore the continued susceptibility of these models to data-driven bias. We have integrated the PTT NTU board as our primary data source for this investigation. Moreover, our study elucidates that, in certain contexts, machines may manifest biases without supplementary prompts. However, they can be guided toward rendering impartial responses when provided with enhanced contextual nuances.
Chinese Language Resources: Data Collection, Linguistic Analysis, Annotation and Language Processing 2023
This chapter will introduce a dynamically integrated lexical resource called DeepLEX. With its modularized architecture, DeepLEX aims to be a fine-grained yet scaled multilingual lexical resource that empowers linguists to pursue a wide array of previously unanswerable research questions. Our approach expands on previous efforts and calls for an open collaboration in which lexical knowledge is semantically founded, symbolically operationalized, and empirically gleaned.
The Cambridge handbook of Chinese linguistics 2022
This chapter explores the morphological poverty of the Chinese from an empirical perspective. Until recently, the nature of affixation in Chinese is still not well recognized and has been one of the hotly debated topics in Chinese morphology. Based on the CKIP Morphological Database (incl. 4025 “affixes” in Chinese), this chapter covers the issue of the lack of affixation in Chinese based on a range of linguistic facts and empirical arguments such as lack of productivity and irregularities in word-formation rules.
The Humanistic Psychologist 2022
As cultural conflicts are intensifying locally and internationally in the aftermath of COVID-19 pandemic, fine-tuned investigation of culture/religion, especially that of the marginalized populations, holds the potential to reduce disparity and suffering in the global village. This study used 3 textual analysis programs—<em>Topic Modeling, C-LIWC, and SSWC-Chinese</em>—to shed light on the differences in cognition and emotion between two communities with radically different religious beliefs (Bimo and Christianity) among the Yi ethnic minority in Southwest China. Findings from these programs replicated the manual coding results of the previous study, and confirmed the prediction that cultural differences in cognition and emotion between the Yi-Bimo and the Yi-Christian fall along the divide between strong-ties and weak-ties rationality (Sundararajan, 2020a). Demonstrating an edge of advantage over manual coding, this machine-assisted analysis lends convergent validity to the previous study, and presents a more nuanced picture of diversity in emotion and cognition among the Chinese, with practical implications for future research and intervention for the marginalized populations.
Proceedings of the thirteenth language resources and evaluation conference 2022
Abstract Constructions are direct form-meaning pairs with possible schematic slots. These slots are simultaneously constrained by the embedded construction itself and the sentential context. We propose that the constraint could be described by a conditional probability distribution. However, as this conditional probability is inevitably complex, we utilize language models to capture this distribution. Therefore, we build CxLM, a deep learning-based masked language model explicitly tuned to constructions’ schematic slots. We first compile a construction dataset consisting of over ten thousand constructions in Taiwan Mandarin. Next, an experiment is conducted on the dataset to examine to what extent a pretrained masked language model is aware of the constructions. We then fine-tune the model specifically to perform a cloze task on the opening slots. We find that the fine-tuned model predicts masked slots more accurately than baselines and generates both structurally and semantically plausible word samples. Finally, we release CxLM and its dataset as publicly available resources and hope to serve as new quantitative tools in studying construction grammar.
Proceedings of the 29th International Conference on Computational Linguistics 2022
Compounding, a prevalent word-formation process, presents an interesting challenge for computational models. Indeed, the relations between compounds and their constituents are often complicated. It is particularly so in Chinese morphology, where each character is almost simultaneously bound and free when treated as a morpheme. To model such word-formation process, we propose the Notch (NOnlinear Transformation of CHaracter embeddings) model and the character Jacobians. The Notch model first learns the non-linear relations between the constituents and words, and the character Jacobians further describes the character’s role in each word. In a series of experiments, we show that the Notch model predicts the embeddings of the real words from their constituents but helps account for the behavioral data of the pseudowords. Moreover, we also demonstrated that character Jacobians reflect the characters’ meanings. Taken together, the Notch model and character Jacobians may provide a new perspective on studying the word-formation process and morphology with modern deep learning.
International Journal of Computational Linguistics & Chinese Language Processing, Volume 27, Number 2, December 2022 2022
Abstract Non-lexical items are expressive devices used in conversations that are not words but are nevertheless meaningful. These items play crucial roles, such as signaling turn-taking or marking stances in interactions. However, as the non-lexical items do not stably correspond to written or phonological forms, past studies tend to focus on studying their acoustic properties, such as pitches and durations. In this paper, we investigate the discourse functions of non-lexical items through their acoustic properties and the phone embeddings extracted from a deep learning model. Firstly, we create a non-lexical item dataset based on the interpellation video clips from Taiwan’s Legislative Yuan. Then, we manually identify the non-lexical items and their discourse functions in the videos. Next, we analyze the acoustic properties of those items through statistical modeling and building classifiers based on phone embeddings extracted from a phone recognition model. We show that (1) the discourse functions have significant effects on the acoustic features; and (2) the classifiers built on phone embeddings perform better than the ones on conventional acoustic properties. These results suggest that phone embeddings may reflect the phonetic variations crucial in differentiating the discourse functions of non-lexical items.
Frontiers in psychology 2021
The present study aimed to investigate the neural mechanism underlying semantic processing in Mandarin Chinese adult learners, focusing on the learners who were Indo-European language speakers with advanced levels of proficiency in Mandarin Chinese. We used functional magnetic resonance imaging technique and a semantic judgment task to test 24 Mandarin Chinese adult learners (L2 group) and 26 Mandarin Chinese adult native speakers (L1 group) as a control group. In the task, participants were asked to indicate whether two-character pairs were related in meaning. Compared to the L1 group, the L2 group had greater activation in the bilateral occipital regions, including the fusiform gyrus and middle occipital gyrus, as well as the right superior parietal lobule. On the other hand, less activation in the bilateral temporal regions was found in the L2 group relative to the L1 group. Correlation analysis further revealed that, within the L2 group, increased activation in the left middle temporal gyrus/superior temporal gyrus (M/STG, BA 21) was correlated with higher accuracy in the semantic judgment task as well as better scores in the two vocabulary tests, the Assessment of Chinese character list for grade to grade 9 (A39) and the Peabody Picture Vocabulary Test-Revised. In addition, functional connectivity analysis showed that connectivity strength between the left fusiform gyrus and left ventral inferior frontal gyrus (IFG, BA 47) was modulated by the accuracy in the semantic judgment task in the L1 group. By contrast, this modulation effect was weaker in the L2 group. Taken together, our study suggests that Mandarin Chinese adult learners rely on greater recruitment of the bilateral occipital regions to process orthographic information to access the meaning of Chinese characters. Also, our correlation results provide convergent evidence that the left M/STG (BA 21) plays a crucial role in the storage of semantic knowledge for readers to access to conceptual information. Moreover, the connectivity results indicate that the left ventral pathway (left fusiform gyrus-left ventral IFG) is associated with orthographic-semantic processing in Mandarin Chinese. However, this semantic-related ventral pathway might require more time and language experience to be developed, especially for the late adult learners of Mandarin Chinese.
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021) 2021
The rapid flow of information and the abundance of text data on the Internet have brought about the urgent demand for the construction of monitoring resources and techniques used for various purposes. To extract facets of information useful for particular domains from such large and dynamically growing corpora requires an unsupervised yet transparent ways of analyzing the textual data. This paper proposed a hybrid collocation analysis as a potential method to retrieve and summarize Taiwan-related topics posted on Weibo and PTT. By grouping collocates of 臺灣 ‘Taiwan’into clusters of topics via either word embeddings clustering or Latent Dirichlet allocation, lists of collocates can be converted to probability distributions such that distances and similarities can be defined and computed. With this method, we conduct a diachronic analysis of the similarity between Weibo and PTT, providing a way to pinpoint when and how the topic similarity between the two rises or falls. A fine-grained view on the grammatical behavior and political implications is attempted, too. This study thus sheds light on alternative explainable routes for future social media listening method on the understanding of cross-strait relationship.
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation 2021
This paper presents a linguistically motivated novel framework that automatically identifies sentiment constructions in a corpus with only sentiment-annotated sentences. Construction, a crucial concept developed in Construction Grammar, is a form-meaning pair that relates a pattern with a specific communicative function. However, handcrafting constructions is laborious and often leads to sparse coverage in practice. We address the problem with a construction induction framework which includes three components: a deeplearning-based predictive model to capture the sentiment aspects of the text, a dynamic word parser that agglomerate tokens into (multi-)words units, and a score assignment mechanism to weigh those units based on their contributions to predictions. Units that score highly in the last step are the candid sentiment constructions. They are automatically post-processed with their linguistic contexts to create the final constructions. We experiment with the proposed framework on a sentiment-annotated corpus of online consumer reviews from Taiwan telecom. The proposed framework correctly assigned higher importance to handcrafted constructions. Furthermore, new constructions identified by the framework are validated by annotators’ rating data.
Proceedings of the 32nd Conference on Computational Linguistics and Speech Processing (ROCLING 2020) 2020
The prevalence of the web has brought about the construction of many large-scale, automatically segmented and tagged corpora, which inevitably introduces errors due to automation and are likely to have negative impacts on downstream tasks. Collocation extraction from Chinese corpora is one such task that is profoundly influenced by the quality of word segmentation. This paper explores methods to mitigate the negative impacts of word segmentation errors on collocation extraction in Chinese. In particular, we experimented with a simple model that aims to combine several association measures linearly to avoid retrieving false collocations resulting from word segmentation errors. The results of the experiment show that this simple model could not differentiate between true collocations and false collocations The 32nd Conference on Computational Linguistics and Speech Processing (ROCLING 2020) Taipei, Taiwan, September 24–26, 2020. The Association for Computational Linguistics and Chinese Language Processing resulting from word segmentation errors. An ad hoc case study incorporating information from FastText word vectors is also conducted. The results show that collocates resulting from correct and erroneous word segmentation have different profiles in terms of the semantic similarities between the collocates. The incorporation of word vector information to differentiate between true and false collocations is suggested for future work.
2020 International Conference on Technologies and Applications of Artificial Intelligence (TAAI) 2020
The modern conversational agent requires high-quality datasets, which are often the bottlenecks when building models. This paper introduces MatDC, an entirely human-produced dialogue dataset with full semantic annotations in Chinese. The dataset features linguistic variations given users' intents and fully annotated semantic slots. MatDC dataset was completely human-edited, and the curation comprises two stages. At first, templates design stage, domain editors first construct schemas and compose ten dialogues between the agents and the users based on the back-end database. Secondly, in the dialogue rewrite stage, rewriters generate sentential variations for each template, under the constraints that the normalized slot values are kept unchanged. The underlying methodology of the MatDC is more open to extension and more adaptable to different domains. To demonstrate the applicability of the dataset, we build a dialogue agent with conventional pipeline architecture. We expect the MatDC dataset to provide additional training data and testing ground for dialogue agent studies.
Proceedings of the 32nd Conference on Computational Linguistics and Speech Processing (ROCLING 2020) 2020
This paper aims to investigate the variation between two Chinese causative auxiliaries shi ‘使’ and rang ‘讓’ from a corpus-based perspective. We conduct a logistic regression analysis to the Chinese data extracted from two corpora and propose a direct/indirect distinction (Verhagen and Kemmer 1997) between the two auxiliary verbs. The results retrieved by the regression model show that the theory of direct/indirect causation provides a reasonable account for the characteristics and lexical meanings of the verbs. We indicate that the verb shi is correlated with “direct causation” because it is typically used when inanimate participants are involved in the causing event, in which the force initiated by the cause inevitably and directly leads to the resulted stage of the causee. On the other hand, the verb rang should be classified as “indirect causation” because it is typically used in scenarios where animate participants are both involved, and some extra force besides the causer also plays a role in the effected event.
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation 2020
Words are conventionalized symbols that present the function by which meaning is attached to form. The Word Sense Disambiguation, which has been taken as one of the core semantic processing tasks in the pipe-lined NLP architecture, aims to assign proper word sense to lemma form in varied contexts based on a word-sense inventory such as WordNet. However, there are some theoretical assumptions unattested from a functional linguistic point of view. This paper proposes an alternative by introducing a novel task called word action disambiguation task (WAD) concentrated on the observable pairs between words and actions. The accompanying dataset, which was manually edited and compiled, is composed of 419 multiple-choice questions. We further verified the dataset through item evaluation with human rating data, and the semantic relations among the dataset were annotated automatically. A baseline performance with an accuracy of 38.64% was also provided with BERT models and 43.18% after incorporating paradigmatic knowledge with semantic graph. We expect the proposal of the WAD task and dataset would motivate computational models to incorporate more complex aspects of human language.
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation 2020
This research explores the intense conflict of the legalization of same-sex marriage in Taiwan by studying how the near-synonyms denoting homosexual, 同 志 tóngzhì and 同性戀 tóngxìngliàn, are used by the opposing stances. Two research questions related to lexical semantics are addressed, i.e., what the semantic difference of these two lexical items is and how these two are characteristically used by each stance. Collocational analysis with self-compiled corpora is the primary method in this study. For the first question, it is found that the meanings of the pair can be differentiated along an internal-external axis. With regards to the second one, it is discovered that, the opponents of same-sex marriage are inclined to externalize the innateness of homosexual, whereas the supporters tend to reduce the distinctiveness of homosexuality and call for the universality of human rights by the strategies of juxtaposing and compounding.
Proceedings of the Twelfth Language Resources and Evaluation Conference 2020
This work collects and studies Chinese readers’ veridicality judgments to news events (whether an event is viewed as happening or not). For instance, in “The FBI alleged in court documents that Zazi had admitted having a handwritten recipe for explosives on his computer”, do people believe that Zazi had a handwritten recipe for explosives? The goal is to observe the pragmatic behaviors of linguistic features under context which affects readers in making veridicality judgments. Exploring from the datasets, it is found that features such as event-selecting predicates (ESP), modality markers, adverbs, temporal information, and statistics have an impact on readers’ veridicality judgments. We further investigated that modality markers with high certainty do not necessarily trigger readers to have high confidence in believing an event happened. Additionally, the source of information introduced by an ESP presents low effects to veridicality judgments, even when an event is attributed to an authority (e.g. “The FBI”). A corpus annotated with Chinese readers’ veridicality judgments is released as the Chinese PragBank for further analysis.
Journal of Cognitive Science 2020
Being a notoriously complex problem, writing is generally decomposed into a series of subtasks: idea generation, expression, revision, etc. Given some goal, the author generates a set of ideas (brainstorming), which he integrates into some skeleton (outline, text plan, outline). This leads to a first draft which is submitted then for revision possibly yielding changes at various levels (content, structure, form). Having made a draft, authors usually revise, edit, and proofread their documents. We confine ourselves here only to academic writing, focusing on sentence production. While there has been quite some work on this topic, most writing assistance has mainly dealt with grammatical errors, editing and proofreading, the goal being the correction of surface-level problems such as typography, spelling, or grammatical errors. We broaden the scope by also including cases where the entire sentence needs to be rewritten in order to express properly all of the information planned. Hence, Sentence-level Revision (SentRev) becomes part of our writing assistance task. Obviously, systems performing well in this task can be of considerable help for inexperienced authors by producing fluent, well-formed sentences based on the user’s drafts. In order to evaluate our SentRev model, we have built a new, freely available crowdsourced evaluation dataset which consists of a set of incomplete sentences produced by nonnative writers paired with final version sentences extracted from published academic papers. We also used this dataset to establish baseline performance on SentRev.
Proceedings of the 28th International Conference on Computational Linguistics 2020
The morphological status of affixes in Chinese has long been a matter of debate. How one might apply the conventional criteria of free/bound and content/function features to distinguish word-forming affixes from bound roots in Chinese is still far from clear. Issues involving polysemy and diachronic dynamics further blur the boundaries. In this paper, we propose three quantitative features in a computational model of affixoid behavior in Mandarin Chinese. The results show that, except for in a very few cases, there are no clear criteria that can be used to identify an affix’s status in an isolating language like Chinese. A diachronic check using contextualized embeddings with the WordNet Sense Inventory also demonstrates the possible role of the polysemy of lexical roots across diachronic settings.
Proceedings of the 32nd Conference on Computational Linguistics and Speech Processing (ROCLING 2020) 2020
Present-day, a majority of representation style on social media (i.e., Instagram) tends to combine visual and textual content in the same message as a consequence of building up a modern way of communication. Message in multimodality is essential in almost any type of social interaction especially in the context of social multimedia content online. Hence, effective computational approaches for understanding documents with multiple modalities are needed to identify the relationship between them. This study extends recent advances in authors intent classification by putting forward an approach using Image-caption Pairs The 32nd Conference on Computational Linguistics and Speech Processing (ROCLING 2020) Taipei, Taiwan, September 24–26, 2020. The Association for Computational Linguistics and Chinese Language Processing (ICPs). Several Machine Learning algorithm like Decision Tree Classifier (DTC’s), Random Forest (RF) and encoders like Sentence-BERT and picture embedding are undertaken in the tasks in order to classify the relationships between multiple modalities, which are 1) contextual relationship 2) semiotic relationship and 3) authors intent. This study points to two possible results. First, despite the prior studies consider incorporating the two synergistic modalities in a combined model will improve the accuracy in the relationship classification task, this study found out the simple fusion strategy that linearly projects encoded vectors from both modalities in the same embedding space may not strongly enhance the performance of that in a single modality. The results suggest that the incorporating of text and image needs more effort to complement each other. Second, we show that these text-image relationships can be classified with high accuracy (86.23%) by using only text modality. In sum, this study may be essential in demonstrating a computational approach to access multimodal documents as well as providing a better understanding of classifying the relationships between modalities.
2nd Conference on Language, Data and Knowledge (LDK 2019) 2019
What is the secret to writing popular novels? The issue is an intriguing one among researchers from various fields. The goal of this study is to identify the linguistic features of several popular web novels as well as how the textual features found within and the overall tone interact with the genre and themes of each novel. Apart from writing style, non-textual information may also reveal details behind the success of web novels. Since web fiction has become a major industry with top writers making millions of dollars and their stories adapted into published books, determining essential elements of “publishable” novels is of importance. The present study further examines how non-textual information, namely, the number of hits, shares, favorites, and comments, may contribute to several features of the most popular published and unpublished web novels. Findings reveal that keywords, function words, and lexical diversity of a novel are highly related to its genres and writing style while dialogue proportion shows the narration voice of the story. In addition, relatively shorter sentences are found in these novels. The data also reveal that the number of favorites and comments serve as significant predictors for the number of shares and hits of unpublished web novels, respectively; however, the number of hits and shares of published web novels is more unpredictable. 2012 ACM Subject Classification General and reference → Empirical studies; General and reference Keywords and phrases Popular Chinese Web Novels, NLP techniques, Sentiment Analysis, Publication of Web novels Digital Object Identifier 10.4230/OASIcs.LDK.2019.24 Category Short Paper
Proceedings of the 33rd Pacific Asia conference on language, information and computation 2019
This paper proposes a computational model of idiomaticity for Chinese Quadra-syllabic idiomatic expressions based on variations, compoundness and compositeness measure. Two classification experiments are conducted to test the model, together with linguistic analysis of the connection to wordnet. The result is promising and we believe that it will shed more light on our understanding of cognitive dynamics that underlies multiword expressions processing.
Proceedings of the 31st Conference on Computational Linguistics and Speech Processing (ROCLING 2019) 2019
Sexually biased cyberhate speech has become a fast-growing problem on PTT (a representative online forum in Taiwan). The applications of computational linguistics like word embeddings would also carry similar biases. This paper analyzed the distribution of word representations of netizens from mu zhu jiao (a cult that often produces misogynistic cyberhate speech). Word vector representations (word2vec) was utilized for scrutinizing semantic representations of texts found on PTT. The findings from the distributed semantic representation of mu zhu jiao implied a sexual bias against them. This paper serves as the first study which investigates the distribution of word representations of the abusive language on PTT forum with an NLP method by taking advantage of both quantitative and qualitative methods.
Proceedings of the Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN) 2019
Chinese characters are unique in its logographic nature, which inherently encodes world knowledge through thousands of years evolution. This paper proposes an embedding approach, namely eigencharacter (EC) space, which helps NLP application easily access the knowledge encoded in Chinese orthography. These EC representations are automatically extracted, encode both structural and radical information, and easily integrate with other computational models. We built EC representations of 5,000 Chinese characters, investigated orthography knowledge encoded in ECs, and demonstrated how these ECs identified visually similar characters with both structural and radical information.
Proceedings of the 10th Global Wordnet Conference 2019
Constructing semantic relations in WordNet has been a labour-intensive task, especially in a dynamic and fast-changing language environment. Combined with recent advancements of contextualized embeddings, this paper proposes the concept of morphology-guided sense vectors, which can be used to semi-automatically augment semantic relations in Chinese Wordnet (CWN). This paper (1) built sense vectors with pre-trained contextualized embedding models; (2) demonstrated the sense vectors computed were consistent with the sense distinctions made in CWN; and (3) predicted the potential semantically-related sense pairs with high accuracy by sense vectors model.
Proceedings of the 9th Global Wordnet Conference 2018
The present work seeks to make the logographic nature of Chinese script a relevant research ground in wordnet studies. While wordnets are not so much about words as about the concepts represented in words, synset formation inevitably involves the use of orthographic and/or phonetic representations to serve as headword for a given concept. For wordnets of Chinese languages, if their synsets are mapped with each other, the connection from logographic forms to lexicalized concepts can be explored backwards to, for instance, help trace the development of cognates in different varieties of Chinese. The Sinitic Wordnet project is an attempt to construct such an integrated wordnet that aggregates three Chinese varieties that are widely spoken in Taiwan and all written in traditional Chinese characters.
華語文教學研究 2018
Native-like cognitive-neural mechanisms for syntactic processing have been shown to be less available for L2 learners. To compensate, learners may rely on lexical-semantic processing or the non-dominant hemisphere. To investigate these scaffolding effects, this study combined divided visual-field (VF) and Event-Related Potential (ERP) techniques to assess L2 learners’ brain responses across the left and right hemispheres (LH and RH). Participants judged the grammaticality of Chinese two-word phrases starting with a classifier. Our data showed that, compared to native speakers, L2 learners were less accurate in grammaticality judgments and elicited qualitatively different brain responses even to correct trials. Replicating our previous findings on left-lateralized structural processing in native speakers, native participants in this present study showed a P600 grammaticality effect with RVF/LH presentation only. L2 learners showed remarkable inter-subject variability in brain responses, and as a consequence, showed no statistically reliable ERP grammaticality effects. However, correlational analysis on individual learners' brain responses and behavioral language performance revealed important correlations.
Journal of Chinese Linguistics 2018
This paper proposes an innovative approach to link basic lexicon (e.g. Swadesh list) to upper ontology as the foundation of OntoLex interface to address the challenge of building language resources for endangered languages in the linked data paradigm. A linked data approach to language resources requires existing, and preferably sizable, language resources. For endangered and other less-resourced languages, however, the scarcity of existing resources limits the possibilities and potential benefits of linking. The challenges are then, how can construction of language resources for endangered language continue to thrive in the linked data paradigm, and how can the linked data approach benefit language resources for endangered languages. Our proposal requires the bare minimum of available data and we show with examples from Formosan languages (Austronesian or aboriginal languages of Taiwan (Blust 2013, 20) that 1) this approach is applicable to endangered languages, and that 2) in spite of the restrictions imposed by scarcity of resources, the linked linguistic data consisting of basic lexicon + upper ontology generate important new information. Comparing Swadesh lists from different languages allowed us to build a small shared ontology that reflects direct human experience, and can serve as the cross-lingual conceptual core. In addition, these micro-ontologized lexicons can be used as seeds for developing a fully-grown and more comprehensive documentation of linguistically motivated ontology for each language.
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) 2018
This paper presents a novel word granularity-aware annotation framework for Chinese. Anchored in current functionalist linguistics, this model rearranges the boundary of word segmentation and linguistic annotation, and gears toward a deeper understanding of lexical units and their behavior. The web-based annotation UI also supports flexible annotation tasks for various linguistic and affective phenomena.
Language and Linguistics 2018
In Generative Lexicon Theory ( glt ) (Pustejovsky 1995), co-composition is one of the generative devices proposed to explain the cases of verbal polysemous behavior where more than one function application is allowed. The English baking verbs were used as examples to illustrate how their arguments co-specify the verb with qualia unification. Some studies (Blutner 2002; Carston 2002; Falkum 2007) stated that the information of pragmatics and world knowledge need to be considered as well. Therefore, this study would like to examine whether glt could be practiced in a real-world Natural Language Processing ( nlp ) application using collocations. We have conducted a fine-grained logical polysemy disambiguation task, taking the open-sourced Leiden Weibo Corpus as resource and computing with Support Vector Machine ( svm ) classifier. Within the classifier, we have taken collocated verbs under glt as main features. In addition, measure words and syntactic patterns are extracted as additional features for comparison. Our study investigates the logical polysemy of the Chinese verb kao ‘bake’. We find that glt could help in identifying logically polysemous cases; additional features would help the classifier achieve a higher performance.
Routledge 2017
This monograph is a translation of two seminal works on corpus-based studies of Mandarin Chinese words and parts of speech. The original books were published as two pioneering technical reports by Chinese Knowledge and Information Processing group (CKIP) at Academia Sinica in 1993 and 1996, respectively. Since then, the standard and PoS tagset proposed in the CKIP report have become the de facto standard in Chinese corpora and computational linguistics, in particular in the context of traditional Chinese texts. This new translation represents and develops the principles and theories originating from these pioneering works. The results can be applied to numerous fields; Chinese syntax and semantics, lexicography, machine translation and other language engineering bound applications. Suitable for graduate and scholars in the fields of linguistics and Chinese, Mandarin Chinese Words and Parts of Speech provides a comprehensive survey of the issues around wordhood and PoS.
Proceedings of the 29th Conference on Computational Linguistics and Speech Processing (ROCLING 2017) 2017
Under the issue of gender and Natural Language Processing (NLP), most papers aim at gendernorm language that spoken by biologically males and females with opposite-sex desires. However, from the point of view of sexual orientation, this study presents the first work in the task of Chinese homosexual identification. Firstly, we collect homosexual texts from social media, and secondly examine linguistic behavior found in gay and lesbian texts. In addition, we also provide sets of linguistic features to automatically predict homosexual language with the adoption of 5-fold cross-validation Support Vector Machine (SVM) and Naive Bayes (NB) models. Training procedure in the study resulted in promising f-score around 70% with the use of particular lexicon-based feature set.
Workshop on Chinese Lexical Semantics 2017
With the development of Teaching Chinese as an International Language and the professionalization trend of Chinese learning, legal Chinese becomes more and more important. To support the legal Chinese teaching and provide Chinese learners, Chinese teachers and other legal workers with authentic data, this paper constructs a legal corpus, which contains 35 legal texts of Mainland China. This study automatically segments the texts into words and manually checks all the segmentation results. Besides, through using the quantitative and qualitative analysis methods, this paper analyzes the common vocabulary of legal Chinese, analyzes the features of legal Chinese, compares the differences between the common vocabulary of legal Chinese and that of the international Chinese teaching, and compares the differences of the common meaning between legal Chinese words and common words in international Chinese vocabulary syllabus. This study also makes reference to the classification of Chinese word level in The Syllabus of Chinese Vocabulary and Characters Levels [18] to classify the words in the legal corpus and explores the application of this corpus in international Chinese teaching. This study finds that there are many differences between legal Chinese and general Chinese, in terms of the common vocabulary and the common meaning of words. So, it can be seen that the legal vocabulary has particularity in the teaching. We cannot directly utilize the existing vocabulary teaching methods to the teaching of legal Chinese vocabulary. Therefore, this paper puts forward several solutions for solving this problem.
Proceedings of the IJCNLP 2017, System Demonstrations 2017
Classifiers are function words that are used to express quantities in Chinese and are especially difficult for language learners. In contrast to previous studies, we argue that the choice of classifiers is highly contextual and train context-aware machine learning models based on a novel publicly available dataset, outperforming previous baselines. We further present use cases for our database and models in an interactive demo system.
Chinese Language and Discourse 2017
This paper investigates the most frequent lexical bundle (LB) ka li kong (to-yousay) (KLK), in an 18.5-hour Taiwanese Southern Min conversation corpus. The analysis focuses on the discourse-pragmatic functions of KLK, the role it plays in the speaker's management of information in talk-in-interaction, and the collocations that are employed. The results show that the speaker utilizes KLK to imply epistemic authority regarding the veracity of the predication. Meanwhile, it expresses the speaker's stance or functions as a discourse organizer to initiate a narrative that is newsworthy. Prosodically, it is always processed as a holistic chunk with great phonological reduction. Along with the low transitivity of the verb kong demonstrated by the type of object it takes, we argue that KLK is developing into a discourse marker. Collocation of KLK with the marker toh further triggers the grammaticalization of the four-word bundle toh ka li kong (TKLK) to encode an extreme stance.
Workshop on Chinese Lexical Semantics 2016
Most corpus-based lexical studies require considerable efforts in manually annotating grammatical relations in order to find the collocations of the target word in corpus data. In this paper, we claim that the current techniques of natural language processing can facilitate lexical research by automating the annotation of these relations. Besides, the technique of word sense disambiguation can provide sense distribution for the word of interest. We exploit the above techniques and report an online open-resource for the comparison of lexical behaviors and sense distribution in cross-strait Chinese variations. The proposed resource is evaluated by juxtaposing the results with previous lexical research based on the same corpus data. The results show that our resource could provide more comprehensive and fine-grained grammatical collocation candidates in the case study.
Concentric: Studies in Linguistics 2016
This article describes an approach to constructing a language resource through automatically sketching grammatical relations of words in an untagged corpus based on dependency parses. Compared to the handcrafted, rule-based Word Sketch Engine (Kilgarriff et al. 2004), this approach provides more details about the different syntagmatic usages of each word such as various types of modification a given word can undergo and other grammatical functions it can fulfill. As a way to properly evaluate the approach, we attempt to evaluate the auto-generated result in terms of the distributional thesaurus function, and compare this with items in an existing thesaurus. Our results have been tailored for the purpose of Chinese learning and, to the best of our knowledge, the resulting resource is the first of its kind in Chinese. We believe it will have a great impact on both Chinese corpus linguistics and Teaching Chinese as a Second Language (TCSL).
Lingua Sinica 2016
In this paper, we present a proposed system designed for sentiment detection for micro-blog data in Chinese. Our system surprisingly benefits from the lack of word boundary in Chinese writing system and shifts the focus directly to larger and more relevant chunks. We use an unsupervised Chinese word segmentation system and binomial test to extract specific and endogenous lexicon chunks from the training corpus. We combine the lexicon chunks with other external resources to train a maximum entropy model for document classification. With this method, we obtained an averaged F1 score of 87.2 which outperforms the state-of-the-art approach based on the released data in the second SocialNLP shared task.
Proceedings of the 28th Conference on Computational Linguistics and Speech Processing (ROCLING 2016) 2016
Based on the assumption that comment with positive sentimental polarity to a negative issue has high probability to be a sarcasm, we propose a simple yet efficient method to collect sarcastic textual data by crowdsourcing with social media and merging game with a purpose approach. Taking advantage of Facebook's reaction button, posts triggering strong negative emotion are collected. Next, by using PTT's search engine, we successfully connect PTT's comments to the collected posts in Facebook and build the sarcasm corpus. Based on the corpus data, the performance comparison of sarcasm detection between SVM with naïve features and Convolutional Neural Network models is conducted. An impressive accuracy rate and great potentials of the corpus are demonstrated.
Proceedings of the 28th Conference on Computational Linguistics and Speech Processing (ROCLING 2016) 2016
Based on the assumption that comment with positive sentimental polarity to a negative issue has high probability to be a sarcasm, we propose a simple yet efficient method to collect sarcastic textual data by crowdsourcing with social media and merging game with a purpose approach. Taking advantage of Facebook's reaction button, posts triggering strong negative emotion are collected. Next, by using PTT's search engine, we successfully connect PTT's comments to the collected posts in Facebook and build the sarcasm corpus. Based on the corpus data, the performance comparison of sarcasm detection between SVM with naïve features and Convolutional Neural Network models is conducted. An impressive accuracy rate and great potentials of the corpus are demonstrated.
Corpus Linguistics and Linguistic Theory 2016
While much attention has been paid to the complement coercion operation in English (e.g., began a book), the same phenomenon in Chinese is still under-researched. Our study examines twenty coercing verbs in Chinese, creating a coercion profile for each verb and conducting a cluster analysis based on the coercion profiles. The results suggest that semantically related verbs in Chinese tend to have similar coercion profiles. We also identify a diverse range of nouns that can be coerced in Chinese. Finally, it is demonstrated that generative approaches to the complement coercion operation in Chinese can be complemented by cognitive-functional approaches.
Proceedings of the 9th International Natural Language Generation Conference 2016
Getting travel tips from the experienced bloggers and online forums has been one of the important supplements to the travel guidebook in the web society. In this paper we present a novel approach by identifying and extracting evaluative patterns, providing a different linguistically-motivated framework for automated evaluative text generation. We target at domain-specific observation in online travel blogs in Chinese. Results suggest that the semantic prosody accompanying the patterns demonstrates that online travel bloggers prefer to employ tacit pragmatic strategy in presenting their sentiment polarity in comments. The extracted patterns and their differentiation can be beneficial to identifying and characterizing evaluative language for further automated opinion summarization and macro/micro planning in natural language generation (NLG) as well.
Proceedings of the 28th Conference on Computational Linguistics and Speech Processing (ROCLING 2016) 2016
This paper tries to demonstrate our exploratory efforts in tackling with the “high accuracy-low quantity” problem of human word sense annotation task in Chinese, and ultimately reach the goal of automatic word sense annotation. Our proposed annotation architecture consists of explicit and implicit aspects of of crowdsourcing approach. Explicit method focuses on the general issues of crowdsourcing and made adjustments on current MTurk framework. Implicit method concentrates on the idea of Game with a Purpose (GWAP) design, which originates from a well-known video game Super Mario.
Proceedings of the 4th Workshop on Linked Data in Linguistics: Resources and Applications 2015
The present study describes recent developments of Chinese Wordnet, which has been reformatted using the lemon model and published as part of the Linguistic Linked Open Data Cloud. While lemon suffices for modeling most of the structures in Chinese Wordnet at the lexical level, the model does not allow for finergrained distinction of a word sense, or meaning facets, a linguistic feature also attended to in Chinese Wordnet. As for the representation of synsets, we use the WordNet RDF ontology for integration’s sake. Also, we use another ontology proposed by the Global WordNet Association to show how Chinese Wordnet as Linked Data can be integrated into the Global WordNet Grid.
Workshop on Chinese Lexical Semantics 2015
This paper aims to investigate degree modification in Mandarin through the case of creative degree modifier各種 [gezhong] (all kinds of; very). We provide a theoretical analysis following the Generative Lexicon Theory and show that各種 [gezhong] not only selects gradable adjectival predicates but also restricts the possible combinations as well as interpretation by means of qualia structure. The restrictions of the modification in turn reflect the pragmatic function of 各種 [gezhong], and mark the distinction between traditional degree modifiers and the creative one.
The Oxford handbook of Chinese linguistics 2015
Lexical semantics was deemed peripheral in formal linguistics’ early pursuit of a rule-based account of language since lexicon is viewed as the repository of idiosyncrasies and meaning is considered fuzzy and difficult to delineate. The papers collected in Levin and Pinker’s (1992) Lexical and Conceptual Semantics, however, reestablished the central place of lexical semantics in linguistics based on the following two observations: lexicon is the repository of all linguistic information as well as the shared interface to all linguistic modules, and conceptual knowledge is represented in language through lexical conventionalization. The study of Chinese lexical semantics shares these two generalizations with a caveat: that the Chinese orthographical system encodes certain conventions of conceptualization. Hence, in this chapter, we pay special attention to the conceptual representation of Chinese characters and its interaction with lexical semantics, in addition to the universal topics of polysemy, semantic relations, and verbal semantics.
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters 2015
With the development of social media and online forums, users have grown accustomed to expressing their agreement and disagreement via short texts. Elements that reveal the user’s stance or subjectivity thus becomes an important resource in identifying the user’s position on a given topic. In the current study, we observe comments of an online bulletin board in Taiwan for how people express their stance when responding to other people’s post in Chinese. A lexicon is built based on linguistic analysis and annotation of the data. We performed binary classification task using these linguistic features and was able to reach an average of 71 percent accuracy. A linguistic analysis on the confusion caused in the classification task is done for future work on better accuracy for such task.
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) 2014
This paper aims to examine and evaluate the current development of using Web-as-Corpus (WaC) paradigm in Chinese corpus linguistics. I will argue that the unstable notion of wordhood in Chinese and the resulting diverse ideas of implementing word segmentation systems have posed great challenges for those who are keen on building web-scaled corpus data. Two lexical measures are proposed to illustrate the issues and methodological discussions are provided.
Traitement Automatique des Langues 2014
RÉSUMÉ. Les dictionnaires sont des objets socioculturels qui peuvent être utilisés comme struc tures sous-jacentes pour la modélisation en sciences cognitives. Nous montrons d’abord que les réseaux lexicaux construits à partir de dictionnaires, malgré un désaccord de surface au niveau des liens, partagent une structure topologique commune. En supposant que cette structure profonde reflète l’organisation sémantique du lexique partagée par les membres d’une communauté linguistique, nous proposons un modèle basé sur l’exploration de cette structure spécifique pour analyser et comparer l’efficacité sémantique des productions [Enfants/Adultes] dans une tâche d’étiquetage d’action. Nous définissons un score générique de l’efficacité sémantique, S KILLEX. Assigné aux participants du protocole A PPROX, ce score nous permet de les classer avec précision dans les catégories enfants et adultes.
Proceedings of the Annual Meeting of the Cognitive Science Society 2014
We propose a model to compute two measurements of semantic efficiency of verbs as action labels. It is based on the exploration of the specific structure of synonymy networks of verbs. We use these measurements to analyse and compare the semantic efficiency of [Children/Adults] productions in action labelling tasks, in French and Mandarin. The combination of these two measurements leads to a generic score of semantic efficiency, Skillex. Assigned to participants of the Approx protocol experiment, this score enables us to accurately classify them into Children and Adults categories, be they French or Mandarin native speakers.
Proceedings of the 26th Conference on Computational Linguistics and Speech Processing (ROCLING 2014) 2014
We proposes a language resource by automatically sketching grammatical relations of words based on dependency parses from untagged texts. The advantage of word sketch based on parsed corpora is, compared to Sketch Engine (Kilgarriff, Rychly, Smrz, & Tugwell, 2004), to provide more details about the different usage of each word such as various types of modification, which is also important in language pedagogy. Although some language resources of other languages have attempted to sketch words based on parsed data, in Chinese we have not seen a resource for dependency sketch of words in customized texts. Therefore, we propose such a resource and evaluate with Chinese Sketch Engine (Huang et al., 2005) in terms of corresponding thesaurus function.
International Journal of Computational Linguistics & Chinese Language Processing, Volume 19, Number 4, December 2014-Special Issue on Selected Papers from ROCLING XXVI 2014
Extracting policy positions from the texts of social media becomes an important technique since instant responses of political news from the public can be revealed, and also one can predict the electoral behavior from this information. The recent highly-debated Cross-Strait Service Trade Agreement (CSSTA) provides large amounts of texts, giving us an opportunity to test people's stance by the text mining method. We use the keywords of each position to do the binary classification of the texts and count the score of how positive or negative attitudes toward CSSTA. We further do the trend analysis to show how the supporting rate fluctuates according to the events. This approach saves human labor of the traditional content analysis and increases the objectivity of the judgement standard.
Proceedings of the Seventh Global Wordnet Conference 2014
Semantic relations of different types have played an important role in wordnet, and have been widely recognized in various fields. In recent years, with the growing interests of constructing semantic network in support of intelligent systems, automatic semantic relation discovery has become an urgent task. This paper aims to extract semantic relations relying on the in situ morpho-semantic structure in Chinese which can dispense of an outside source such as corpus or web data. Manual evaluation of thousands of word pairs shows that most relations can be successful predicted. We believe that it can serve as a valuable starting point in complementing with other approaches, which will hold promise for the robust lexical relations acquisition.
Workshop on Chinese Lexical Semantics 2014
What determines the “basicness” of words still remains a challenging question in creating basic lexicons and basic wordlists. Since frequency and dispersion seem to be the most dominant criteria, it is questioned that whether contextual factors also help to define the concept of “basicness.” From the perspective of the distributional model, meanings are represented through the interaction between words and their contexts. Hence, this research aims to examine an existing wordlist and tentatively take it as the standard of “basicness,” trying to seek the differences between “basic words” and “non-basic words” based on their occurrences in different texts. Two experiments were conducted to answer the research questions. The first calculated the “latent semantic distances” between basic words and non-basic words. The second calculated and examined the “near neighbors” of basic word and non-basic words. It has been discovered that basic words tend to occur in more similar texts than non-basic words do; in addition, the near neighbors of basic words tend to be more “basic”, too. This research contributes to providing a more “contextual” perspective in exploring “basicness.”
Towards the Multilingual Semantic Web: Principles, Methods and Applications 2014
We discuss the development of a multilingual lexicon linked to the Suggested Upper Merged Ontology (SUMO) formal ontology. The ontology as well as the lexicon have been expressed in Web Ontology Language (OWL), as well as their original formats, for use on the semantic web and in linked data. We describe the Open Multilingual Wordnet (OMW), a multilingual wordnet with 22 languages and a rich structure of semantic relations. It is made by exploiting links from various monolingual wordnets to the English Wordnet. Currently, it contains 118,337 concepts expressed in 1,643,260 senses in 22 languages. It is available as simple tab-separated files, Wordnet-Lexical Markup Framework (LMF) or lemon and had been used by many projects including BabelNet and Google Translate. We discuss some issues in extending the wordnets and improving the multilingual representation to cover concepts not lexicalized in English and how concepts are stated in the formal ontology.
Proceedings of the 6th International Conference on Generative Approaches to the Lexicon (GL2013) 2013
This study takes a corpus-based approach to examine twenty Chinese verbs that have been found to coerce their NP complements into an event type (cf. Lin et al. 2009), with an aim of creating a coercion profile for each verb. A cluster analysis is further conducted on the coercion profiles. The resulting clusters in our analysis show a bi-directional distribution: the verbs in Cluster 1 are found to coerce their complements more frequently, while the verbs in Cluster 2 are found to coerce more noun types. Moreover, many lexical pairs (e.g., antonyms and near-synonyms) are identified in the two clusters. Our quantitative analysis suggests that semantically related verbs can have similar coercion profiles. The empirical findings of the present study complement intuition-based studies on the complement coercion operation in Chinese (e.g., Lin and Liu 2004, Liu 2003) and shed new light on the theoretical framework of the Generative Lexicon.
Workshop on Chinese Lexical Semantics 2013
Language change is a ubiquitous and inevitable phenomenon in daily usages, represented by both novel interpretations and usages of old words, as well as through the development of entirely new words called neologisms. This study aims to give a theoretical account of the prefix 微 [wéi] in Mandarin, which recently has extended its meanings by combing with modified nouns in varied contexts, for instance, 微電影 [wéi-diànyĭng] (short film), 微環島 [wéi-huándǎo] (riding bicycle to travel the northern coastline of Taiwan), 微開車 [wéi-kāichē] (riding motorcycle) and so on. To analyze this phenomenon through the lens of lexical semantics, we follow the Generative Lexicon Theory to explore the selective binding of wéi + noun modification in terms of qualia structure. The result shows that wéi has a high preference for selection of the FORMAL role but excludes TELIC. Possible explanations are given for the underlying reasons for psychological preferences in perceiving FORMAL components of objects, such as shape and color, rather than the specification of function or purpose.
Proceedings of the 25th Conference on Computational Linguistics and Speech Processing (ROCLING 2013) 2013
PTT (批踢踢) is one of the largest web forums in Taiwan. In the last few years, its importance has been growing rapidly because it has been widely mentioned by most of the mainstream media. It is observed that its influence reflects not only on the society but also on the language novel use in Taiwan. In this research, a pipeline processing system in Python was developed to collect the data from PTT, and the n-gram model with proposed linguistic filter are adopted with the attempt to capture two-character neologisms emerged in PTT. Evaluation task with 25 subjects was conducted against the system's performance with the calculation of Fleiss’ kappa measure. Linguistic discussion as well as the comparison with time series analysis of frequency data are provided. It is hoped that the detection of neologisms in PTT can be improved by observing the features, which may even facilitate the prediction of the neologisms in the future.
Proceedings of the 6th International Conference on Generative Approaches to the Lexicon (GL2013) 2013
In the Generative Lexicon Theory (GLT), co-composition is one of the generative devices proposed to explain the cases of verbal polysemous behavior where more than one function application is allowed. The English baking verbs were used as one of the examples to illustrate how their complements co-specify the verb with qualia unification. In this paper, we begin by exploring the polysemy of Chinese baking verb, where the first two senses in Chinese Wordnet (CWN) are assumed. Features including linguistic cues and common sense knowledge are involved in the experiment with Weibo corpus and computed with SVM for closer investigation. From the analysis, it is found that though there are various cases found in senses of change of state and creation, a coarse but systematic approach combined with certain features in disambiguating CWN senses could be arranged. In addition, we further observe that the usage of various instruments cases and classifiers would be harnessed by underlying background knowledge to help select an appropriate sense based on the context.
Proceedings of the 6th International Conference on Generative Approaches to the Lexicon (GL2013) 2013
In the Generative Lexicon Theory (GLT), co-composition is one of the generative devices proposed to explain the cases of verbal polysemous behavior where more than one function application is allowed. The English baking verbs were used as one of the examples to illustrate how their complements co-specify the verb with qualia unification. In this paper, we begin by exploring the polysemy of Chinese baking verb, where the first two senses in Chinese Wordnet (CWN) are assumed. Features including linguistic cues and common sense knowledge are involved in the experiment with Weibo corpus and computed with SVM for closer investigation. From the analysis, it is found that though there are various cases found in senses of change of state and creation, a coarse but systematic approach combined with certain features in disambiguating CWN senses could be arranged. In addition, we further observe that the usage of various instruments cases and classifiers would be harnessed by underlying background knowledge to help select an appropriate sense based on the context.
2013
The present study proposes an innovative way of expanding the lexical repository of Chinese Wordnet (CWN). Fine-grained as the senses and sense facets of its entries are, the current status of CWN fails to include such high-frequency words as àiqíng (‘love’) and such high-familiarity words as guānxīn (‘to care’) due to the lac of linguistic manpower. In view of this limited inclusion of words on CWN, we propose to extend its lexical knowledge by constructing a wiki for CWN, or CWIKIN—a collaborative platform on which registered users can contribute to CWN by adding new entries, editing existing ones and rating one another’s contribution to ensure the quality of collective intelligence. What distinguishes CWIKIN from a typical wiki is that it presents synonymous sets of Chinese words that are currently only implicitly represented on CWN, and helps users in adding those words as well as various lexical semantic relations by suggesting potential equivalents along with their parts-of-speech, definitions and example sentences bootstrapped from Princeton WordNet via Sinica BOW, a bilingual ontological wordnet. We believe that the proposed platform will facilitate the enrichment of Chinese Lexical Resources in the context of Web collective intelligence and contribute to the advancement of Chinese Lexicography.
Proceedings of the 25th Conference on Computational Linguistics and Speech Processing (ROCLING 2013) 2013
This paper aims to seek approaches in investigating the relationships within emotion words under linguistic aspect, rather than figuring out new algorithms or so in processing emotion detection. It is noted that emotion words could be categorized into two groups: emotion-inducing words and emotion-describing words, and emotion-inducing words would be able to trigger emotions expressed via emotion-describing words. Hence, this paper takes the social network Plurk, the emotion words are from the study on Standard Stimuli and Normative Responses of Emotions (SSNRE) in Taiwan and the National Taiwan University Sentiment Dictionary (NTUSD) as corpus, combining with Principle Component Analysis (PCA) and followed collocation approach, in order to make a preliminary exploration in observing the interactions between emotion-inducing and emotion-describing words. From the results, it is found that though the retrieved Plurk posts containing emotion-inducing words, polarities of the induced emotion-describing words contained within the posts are not consistent. In addition, the polarities of posts would not only be influenced by emotion words, but negation words, modal words and certain content words within context.
International Journal of Computational Linguistics & Chinese Language Processing, Volume 18, Number 2, June 2013-Special Issue on Chinese Lexical Resources: Theories and Applications 2013
There has been no consensus as to what constitutes a set of base concepts in the mental landscape. With the aim of exploring base concepts in Chinese, this paper proposes that frequently-occurring words in the glosses of a lexical resource such as the Chinese Wordnet can be seen as a candidate set of base concepts because the glosses use basic words. The present study identified 130 base concepts in Chinese. The Base Concepts in EuroWordNet were adopted as a reference for comparison. While only 44.6% of the base concepts identified in the present study have an equivalent in the set of Base Concepts of EuroWordNet, the other base concepts extracted by our gloss-based approach also reflect a certain degree of basicness. It is hoped that both the overlap and the difference between different sets of base concepts identified in different languages and by different approaches can deepen our understanding of the basic core in the mind. Additionally, it is also hoped that the set of base concepts identified in the present study can have computational as well as pedagogical applications in the future.
International Journal of Computer Processing Of Languages 2012
In Buddhist Digital Archives, there are three core elements — lexicon, content and catalog that represent the knowledge of Buddhist Scriptures. However, the close relationship among these three core elements has not been explicitly and systematically highlighted. This paper aims to propose a framework for the integration of cross-language Buddhist Scriptures and traditional Buddhist taxonomic knowledge structure by applying current OntoLex (Ontological-Lexicon) techniques. In addition, an innovative attempt to import the concept of Ontology catalog in building cross-language Buddhist Tripitaka catalog in Chinese, Pali, Tibetan and Sanskrit is introduced. This paper starts with a portion of textual data from the CBETA, a comprehensive Chinese Buddhist Digital Archive. We believe that the ontological and lexical knowledge preserved in CBETA is the treasure to be discovered and explored. The mining of the large-scaled historical texts will not only enhance our understanding of Buddhist thought interpreted in varied temporal and geographical contexts, but also the interplay of human language and cognition in the diachronic multilingual contexts.
Proceedings of the 24th Conference on Computational Linguistics and Speech Processing (ROCLING 2012) 2012
This study adopts a corpus-based computational linguistic approach to measure individual differences (IDs) in visual word recognition. Word recognition has been a cardinal issue in the field of psycholinguistics. Previous studies examined the IDs by resorting to test-based or questionnaire-based measures. Those measures, however, confined the research within the scope where they can evaluate. To extend the research to approximate to IDs in real life, the present study undertakes the issue from the observations of experiment participants’ daily-life lexical behaviors. Based on participants’ Facebook posts, two types of personal lexical behaviors are computed, including the frequency index of personal word usage and personal word frequency. It is investigated that to what extent each of them accounts for participants’ variances in Chinese word recognition. The data analyses are carried out by mixed-effects models, which can precisely estimate by-subject differences. Results showed that the effects of personal word frequency reached significance; participants responded themselves more rapidly when encountering more frequently used words. People with lower frequency indices of personal word usage had a lower accuracy rates than others, which was contrary to our prediction. Comparison and discussion of the results also reveal methodology issues that can provide noteworthy suggestions for future research on measuring personal lexical behaviors. 61 Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
International Journal of Computational Linguistics & Chinese Language Processing, Volume 17, Number 2, June 2012—Special Issue on Selected Papers from ROCLING XXIII 2012
This study examines how different dimensions of corpus frequency data may affect the outcome of statistical modeling of lexical items. Our analysis mainly focuses on a recently constructed elderly speaker corpus that is used to reveal patterns of aging people’s language use. A conversational corpus contributed by speakers in their 20s serves as complementary material. The target words examined are temporal expressions, which might reveal how the speech produced by the elderly is organized. We conduct divisive hierarchical clustering analyses based on two different dimensions of corporal data, namely raw frequency distribution and collocation-based vectors. When different dimensions of data were used as the input, results showed that the target terms were clustered in different ways. Analyses based on frequency distributions and collocational patterns are distinct from each other. Specifically, statistically-based collocational analysis generally produces more distinct clustering results that differentiate temporal terms more delicately than do the ones based on raw frequency. 1 Acknowledgement: Thanks Wang Chun-Chieh, Liu Chun-Jui, Anna Lofstrand, and Hsu Chan-Chia for their involvement in the construction of the elderly speakers’ corpus and the early development of this paper. ∗ Graduate Institute of Linguistics, National Taiwan University, 3F, Le-Xue Building, No. 1, Sec. 4, Roosevelt Rd., Taipei Taiwan, 106 E-mail: {sftwang0416; flower75828; june06029}@gmail.com; shukaihsieh@ntu.edu.tw + Department of English, National Taiwan Normal University, No. 162, He-ping East Road, Section 1, Taipei, Taiwan, 106 E-mail: Yw_L7@hotmail.com 38 Sheng-Fu Wang et al
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation 2012
This study aims to propose a novel pipeline architecture in building and analyzing largescaled linguistic data on the cloud-based environment, an experimental survey on Chinese Polarity Lexicon will be taken as an example. In this experiment, data are evaluated and tagged by applying crowd sourcing approach using online Google Form. All the data processing and analyzing procedures are completed on-the-fly with free cloud services automatically and dynamically. The paper shows the advantages of using cloud-based environment in collecting and processing linguistic data which can be easily scaled up and efficiently computed. In addition, the proposed pipeline architecture also brings out the potentials of merging with mashups from the web for representing and exploring corpus data of various types.
International Journal of Computer Processing Of Languages 2011
The representation of lexical semantic knowledge has been one of the most important research topics in the field of computational lexical semantics. Among relevant lexical resources, the design architecture of Princeton WordNet is the most popular one. In this paper, however, we argue that the current synset scheme requires more extensions when applied to the analysis of deeper sense structure in Chinese Wordnet. Issues involved include the underlying structure of sense, meaning facet and their relations. Based on a large amount of empirical analysis of sense data, this paper proposes a fine-grained framework in representing lexical semantic knowledge for Chinese Wordnet, which we believe will be an important consideration for the envisioned cross-lingual global wordnet grid construction. The systematic polysemy patterns found among meaning facets can also be used as a human gold standard of hand-annotated data for metonymy resolution task.
Handbook of Research on Culturally-Aware Information Technology: Perspectives and Models 2011
KYOTO is an Asian-European project developing a community platform for modeling knowledge and finding facts across languages and cultures. The platform operates as a Wiki system that multilingual and multi-cultural communities can use to agree on the meaning of terms in specific domains. The Wiki is fed with terms that are automatically extracted from documents in different languages. The users can modify these terms and relate them across languages. The system generates complex, language-neutral knowledge structures that remain hidden to the user but that can be used to apply open text mining to text collections. The resulting database of facts will be browseable and searchable. Knowledge is shared across cultures by modeling the knowledge across languages. The system is developed for 7 languages and applied to the domain of the environment, but it can easily be extended to other languages and domains.
Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing (ROCLING 2011) 2011
This study examines how different dimensions of corpus frequency data may affect the outcome of statistical modeling of lexical items. The corpus used in our analysis is an elderly speaker corpus in its early development, and the target words are temporal expressions, which might reveal how the speech produced by the elderly is organized. We conduct divisive hierarchical clustering based on two different dimensions of corpus data, namely raw frequency distribution and collocation-based vectors. Results show when different dimensions of data were used as the input, the target terms were indeed clustered in different ways. Analyses based on frequency distributions and collocational patterns are distinct from each other. Specifically, statistically-based collocational analysis produces more distinct clustering results that differentiate temporal terms more delicately than do the ones based on raw frequency.
Coling 2010: Posters 2010
The aim of this study is to use the word-space model to measure the semantic loads of single verbs, profile verbal lexicon acquisition, and explore the semantic information on Chinese resultative verb compounds (RVCs). A distributional model based on Academia Sinica Balanced Corpus (ASBC) with Latent Semantic Analysis (LSA) is built to investigate the semantic space variation depending on the semantic loads/specificity. The between group comparison of age-related changes in verb style is then conducted to suggest the influence of semantic space on verbal acquisition. Finally, it demonstrates how meaning exploring on RVCs is done with semantic space.
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation 2010
In this paper we define a lexical metrology in graphs of verbal synonymy to compute the flexsemic score of speakers from their verbal productions in action denomination tasks. This flexsemic score is used to automatically categorize young children versus young adults. We show that this score is effective in French and in Mandarin.
ROCLING 2010 Poster Papers 2010
In analyzing the formation of a given compound, both its internal syntactic structure and semantic relations need to be considered. The Generative Lexicon Theory (GL Theory) provides us with an explanatory model of compounds that captures the qualia modification relations in the semantic composition within a compound, which can be applied to natural language processing tasks. In this paper, we primarily discuss the qualia structure of noun-noun compounds found in Chinese as well as a couple of other languages like German, Spanish, Japanese and Italian. We briefly review the construction of compounds and focus on the noun-noun construction. While analyzing the semantic relationship between the words that compose a compound, we use the GL Theory to demonstrate that the proposed qualia structure enables compositional interpretation within the compound. Besides, we attempt to examine whether or not for each semantic head, its modifier can fit in one of the four quales. Finally, our analysis reveals the potentials and limits of qualia-based treatment of composition of nominal compounds and suggests a path for future work.
Coling 2010: Demonstrations 2010
This presentation introduces a Python module (PyCWN) for accessing and processing Chinese lexical resources. In particular, our focus is put on the Chinese Wordnet (CWN) that has been developed and released by CWN group at Academia Sinica. PyCWN provides the access to Chinese Wordnet (sense and relation data) under the Python environment. The presenation further demonstrates how this module applies to a variety of lexical processing tasks as well as the potentials for multilingual lexical processing.
Proceedings of the 5th International Workshop on Semantic Evaluation 2010
This document describes the preliminary release of the integrated Kyoto system for specific domain WSD. The system uses concept miners (Tybots) to extract domain-related terms and produces a domain-related thesaurus, followed by knowledge-based WSD based on wordnet graphs (UKB). The resulting system can be applied to any language with a lexical knowledge base, and is based on publicly available software and resources. Our participation in Semeval task# 17 focused on producing running systems for all languages in the task, and we attained good results in all except Chinese. Due to the pressure of the time-constraints in the competition, the system is still under development, and we expect results to improve in the near future.
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation 2010
In this paper we describe the data that will be used to compare the semantic structures that emerge from synonymy in French and in Mandarin. We aim at studying these semantic structures at both a global, lexicographic level, using lexicons, synonymy and translation dictionaries and at a more localised, experimental level, using data collected in parallel psycholinguistic experiments in French and Mandarin. After presenting our research project, the data we need to carry it out and the available resources, we analyse several linguistic issues arising from the structural differences between the French and Mandarin lexicons. We then explain the construction of the synonymy and translation networks from the available resources and detail specific choices that will enable us to produce meaningful experimental results based on this prepared data. Two kinds of networks are built: lexicographic networks and smaller movie-based networks extracted from experimental recordings. We conclude by describing how we intend to use this data.
Journées Sémantique et Modélisation Conference on Semantics and Formal Modelling 2010
Classifiers in Mandarin Chinese are required elements of well-formed noun phrases. They have to appear between the determiner and the noun as shown in example (1). The variety of classifiers, has been described and analyzed in [9, 4, 6] among others. Classifiers are often separated from measure words by requiring them to hold a [+sortal] attribute [4]. In [6], classifiers are divided into individual, kind and event classifiers on the ground of a corpusbased classifier dictionary [5]. A formal analysis of the distinction between kind and individual classifier has been proposed in [8]. In this study we will focus however on individual classifier and measure words. Huang and Ahrens suggest, without proposing a formal account, that classifiers can coerce the interpretation of the noun they classify (as in (1) taken from [6, p361]) . (1) 一 朵 /株 花 yi4 duo3/zhu1 hua1 one CL.bud/CL.plant flower one flower bud/one flowering plant In this work, we are dealing in the interaction between two phenomena: (i) classifier coercion (illustrated in the previous examples) and (ii) the anaphoric use of classifiers, exemplified in (2). In this example, the second sentence is missing a noun. As a reviewer pointed us, it is tempting to assume that in ’DET+CL’ constructions the classifier becomes a noun. However, most of the classifiers cannot be used as nouns in other contexts.1 It would be going against the standard view on Mandarin syntax to assume here a category change, see for example [10]. 水餃 五 粒 了 (2) 我 買 shui3jiao3 wo3 mai3 le5 wu3 li4 buy ASP five CL.grain dumplings I I bought five dumplings 四 粒 你 吃 了 ni3 chi1 le5 si4 li4 you eat ASP four CL.grain You ate four Some words like 碗 /wan3 (bowl), can be used both as classifiers and nouns, but as nouns they requires another classifier to precede them.
Proceedings of the 22nd Conference on Computational Linguistics and Speech Processing (ROCLING 2010) 2010
In this paper, we present a simple but efficient approach for the automatic mood classification of microblogging messages from Plurk platform. In contrast with Twitter, Plurk has become the most popular microblogging service in Taiwan and other countries 1; however, no previous research has been done for the emotion and mood recognition, nor the Chinese affective terms or corpus available. Following the line of mashup programming, we thus construct a dynamic plurk corpus by pipelining Plurk APIs, Yahoo! Chinese segmentation APIs, etc to preprocess and annotate the corpus data. Based on the corpus, we conduct experiments by way of combining textual statistics and emoticons data, and our method yield the results with high performance. This work can be further extended to combine with affective ontology designed with emotion theory of appraisal. Keyword: mood classification, plurks, keyness, emotion paradox According to Alvin, the cofounder of Plurk website, the number of the plurkers in Taiwan had reached approximately 1 million, which was one-third of the total plurkers in October, 2009. Another statistic data is collected from Google trend for website, manifesting that Taiwan is the rank one region of visiting Plurk website (August,2010). Proceedings of the 22nd Conference on Computational Linguistics and Speech Processing (ROCLING 2010), Pages 172-183, Puli, Nantou,Taiwan, September 2010.
Proceedings of Compositionality and Distributional Semantic Models workshop (ESSLLI) 2010
Proceedings of the 2009 workshop on the people's web meets NLP: Collaboratively constructed semantic resources 2009
Wiktionary, a satellite of the Wikipedia initiative, can be seen as a potential resource for Natural Language Processing. It requires however to be processed before being used efficiently as an NLP resource. After describing the relevant aspects of Wiktionary for our purposes, we focus on its structural properties. Then, we describe how we extracted synonymy networks from this resource. We provide an in-depth study of these synonymy networks and compare them to those extracted from traditional resources. Finally, we describe two methods for semiautomatically improving this network by adding missing relations:(i) using a kind of semantic proximity measure;(ii) using translation relations of Wiktionary itself.
Proceedings of the 7th Workshop on Asian Language Resources (ALR7) 2009
This paper reports prototype multilingual query expansion system relying on LMF compliant lexical resources. The system is one of the deliverables of a three-year project aiming at establishing an international standard for language resources which is applicable to Asian languages. Our important contributions to ISO 24613, standard Lexical Markup Framework (LMF) include its robustness to deal with Asian languages, and its applicability to cross-lingual query tasks, as illustrated by the prototype introduced in this paper.
Concentric: Studies in Linguistics 2009
Lexical semantic relations have played an important role in the recent developments of Natural Language Processing and Computational Lexical Resources as well. This paper reviews the notion of lexical semantic relations in the WordNet-like lexical resources, and proposes a formal modeling of lexical semantic relations using the extended Formal Concept Analysis. I believe that the proposed formalization will be able to highlight problems with regard to lexical and cultural gaps, and serve as a foundation for solutions that support lexical theoretical explorations and applications for multilingual wordnets in the future. Key words: lexical semantics, computational lexicon, Formal Concept Analysis
Language Resources and Evaluation 2009
In this paper we present an application fostering the integration and interoperability of computational lexicons, focusing on the particular case of mutual linking and cross-lingual enrichment of two wordnets, the ItalWordNet and Sinica BOW lexicons. This is intended as a case-study investigating the needs and requirements of semi-automatic integration and interoperability of lexical resources, in the view of developing a prototype web application to support the GlobalWordNet Grid initiative.
The 5th International Conference on Generative Approaches to the Lexicon 2009
This study aims to explore Chinese type coercion, a phenomenon which has been refuted by some linguists. Their discussion has been based on the translation of the English sentences discussed in the coercion literature. We point out the inappropriateness of this approach by showing that it does not take account of the lexical semantics of the target language and the real language use. To show that type coercion is pervasive in Chinese, we adopt the corpus-based approach (web as corpus) and focus on one of the generative mechanisms proposed in Pustjovsky (1995), namely, true complement coercion. A handcrafted lexico-syntactic template is used to extract coercion data along with their noncoercive counterparts from the Web. Our preliminary results support the hypothesis that true complement coercion is a universal linguistic mechanism. Our data-extraction algorithm is also likely to be useful in automatically extracting coercion data from corpora for future theoretical and computational studies.
Proceedings of the 7th Workshop on Asian Language Resources (ALR7) 2009
Lexical Markup Framework (LMF, ISO-24613) is the ISO standard which provides a common standardized framework for the construction of natural language processing lexicons. LMF facilitates data exchange among computational linguistic resources, and also promises a convenient uniformity for future application. This study describes the design and implementation of the WordNet-LMF used to represent lexical semantics in Chinese WordNet. The compiled CWN-LMF will be released to the community for linguistic researches.
Chinese Lexical Semantics Workshop (CLSW) 2009
This study proposes an approach to extract domain-specific words, and to distinguish the word senses with the aim of extending current WordNet architecture for domain applications. The domain-specific lexicon is compiled with a Wordnet-LMF format in compliance with 180 1643 for the internationally collaborative KYOTO project. The findings and results provide a preliminary resource extension for cross-language domain knowledge exchange and show the benefits for domain-specific applications.
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2 2009
Modeling of semantic space is a new and challenging research topic both in cognitive science and linguistics. Existing approaches can be classified into two different types according to how the calculation are done: either a word-by-word co-occurrence matrix or a word-by-context matrix (Riordan 2007). In this paper, we argue that the existing popular distributional semantic model (vector space model), does not adequately explain the age-ofacquisition data in Chinese. An alternatively measure of semantic proximity called PROX (Gaume et al, 2006) is applied instead. The application or PROX has interesting psycholinguistic implications. Unlike previous semantic space models, PROX can be trained with children’s data as well as adult data. This allows us to test the hypothesis that children’s semantic space approximates the target of acquisition: adult’s semantic space. It also allows us to compare our Chinese experiment results with French results to see to attest the universality of the approximation model.
International Journal of Computational Linguistics & Chinese Language Processing, Volume 14, Number 1, March 2009 2009
Shu-Yen Lin,Cheng-Chao Su,Yu-Da Lai,Li-Chin Yang,Shu-Kai Hsieh,Readability,Prototype Theory,WordNe,月旦知識庫-文獻檢索站,提供期刊、論著、教學案例、學位論文檢索查詢服務,是學習研究、實務工作的好幫手!
Proceedings of the 20th Conference on Computational Linguistics and Speech Processing 2008
The noise robustness property for an automatic speech recognition system is one of the most important factors to determine its recognition accuracy under a noise-corrupted environment. Among the various approaches, normalizing the statistical quantities of speech features is a very promising direction to create more noise-robust features. The related feature normalization approaches include cepsral mean subtraction (CMS), cepstral mean and variance normalization (CMVN), histogram equalization (HEQ), etc. In addition, the statistical quantities used in these techniques can be obtained in an utterance-wise manner or a codebook-wise manner. It has been shown that in most cases, the latter behaves better than the former. In this paper, we mainly focus on two issues. First, we develop a new procedure for developing the pseudo-stereo codebook, which is used in the codebook-based feature normalization approaches. The resulting new codebook is shown to provide a better estimate for the features statistics in order to enhance the performance of the codebook-based approaches. Second, we propose a series of new feature normalization approaches, including associative CMS (A-CMS), associative CMVN (A-CMVN) and associative HEQ (A-HEQ). In these approaches, two sources of statistic information for the features, the one from the utterance and the other from the codebook, are properly integrated. Experimental results show that these new feature normalization approaches perform significantly better than the conventional utterance-based and codebook-based ones. As the result, the proposed methods in this paper effectively improve the noise robustness of speech features. ᣂဲΚ۞೯ଃᙃᢝΕᒘΕൎࢤଃᐛ
Workshop on Linguistic Studies of Ontology, International Congress of Linguistics 2008
Workshop on Linguistic Studies of Ontology, International Congress of Linguistics 2008
International Conference on Large-Scale Knowledge Resources 2008
Sense tagged corpus plays a very crucial role to Natural Language Processing, especially on the research of word sense disambiguation and natural language understanding. Having a large-scale Chinese sense tagged corpus seems to be very essential, but in fact, such large-scale corpus is the critical deficiency at the current stage. This paper is aimed to design a large-scale Chinese full text sense tagged Corpus, which contains over 110,000 words. The Academia Sinica Balanced Corpus of Modern Chinese (also named Sinica Corpus) is treated as the tagging object, and there are 56 full texts extracted from this corpus. By using the N-gram statistics and the information of collocation, the preparation work for automatic sense tagging is planned by combining the techniques and methods of machine learning and the probability model. In order to achieve a highly precise result, the result of automatic sense tagging needs the touch of manual revising.
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I 2008
Numerative classifiers are ubiquitous in many Asian languages. This paper proposes a method to construct a taxonomy of numerative classifiers based on a nounclassifier agreement database. The taxonomy defines superordinate-subordinate relation among numerative classifiers and represents the relations in tree structures. The experiments to construct taxonomies were conducted for evaluation by using data from three different languages: Chinese, Japanese and Thai. We found that our method was promising for Chinese and Japanese, but inappropriate for Thai. It confirms that there really is no hierarchy among Thai classifiers.
ROCLING 2008 Poster Papers 2008
Synset and semantic relation based lexical knowledge base such as wordnet, have been well-studied and constructed in English and other European languages (EuroWordnet). The Chinese wordnet (CWN) has been launched by Academia Sinica basing on the similar paradigm. The synset that each word sense locates in CWN are manually labeled, however, the lexical semantic relations among synsets are not fully constructed yet. In this present paper, we try to propose a lexical pattern-based algorithm which can automatically discover the semantic relations among verbs, especially the troponymy relation. There are many ways that the structure of a language can indicate the meaning of lexical items. For Chinese verbs, we identify two sets of lexical syntactic patterns denoting the concept of hypernymy-troponymy relation. We describe a method for discovering these syntactic patterns and automatically extracting the target verbs and their corresponding hypernyms. Our system achieves satisfactory results and we beleive it will shed light on the task of automatic acquisition of Chinese lexical semantic relations and ontology learning as well.
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08) 2008
Corpus-based approaches and statistical approaches have been the main stream of natural language processing research for the past two decades. Language resources play a key role in such approaches, but there is an insufficient amount of language resources in many Asian languages. In this situation, standardisation of language resources would be of great help in developing resources in new languages. This paper presents the latest development efforts of our project which aims at creating a common standard for Asian language resources that is compatible with an international standard. In particular, the paper focuses on i) lexical specification and data categories relevant for building multilingual lexical resources for Asian languages; ii) a core upper-layer ontology needed for ensuring multilingual interoperability and iii) the evaluation platform used to test the entire architectural framework.
Proceedings of the 20th Conference on Computational Linguistics and Speech Processing 2008
A realistic Chinese word segmentation tool must adapt to textual variations with minimal training input and yet robust enough to yield reliable segmentation result for all variants. Various lexicon-driven approaches to Chinese segmentation, e.g. [1,16], achieve high f-scores yet require massive training for any variation. Text-driven approach, e.g. [12], can be easily adapted for domain and genre changes yet has difficulty matching the high f-scores of the lexicon-driven approaches. In this paper, we refine and implement an innovative text-driven word boundary decision (WBD) segmentation model proposed in [15]. The WBD model treats word segmentation simply and efficiently as a binary decision on whether to realize the natural textual break between two adjacent characters as a word boundary. The WBD model allows simple and quick training data preparation converting characters as contextual vectors for learning the word boundary decision. Machine learning experiments with four different classifiers show that training with 1,000 vectors and 1 million vectors achieve comparable and reliable results. In addition, when applied to SigHAN Bakeoff 3 competition data, the WBD model produces OOV recall rates that are higher than all published results. Unlike all previous work, our OOV recall rate is comparable to our own F-score. Both experiments support the claim that the WBD model is a realistic model for Chinese word segmentation as it can be easily adapted for new variants with the robust result. In conclusion, we will discuss linguistic ramifications as well as future implications for the WBD approach.
4th Global WordNet Conference, GWC 2008 2007
Lexical chaining is regarded to be a valuable resource for NLP applications, such as automatic text summarization or topic detection. Typically, lexical chainers use a word net to compute semantically motivated partial text representations. However, their output is normally evaluated with respect to an application since generic evaluation criteria have not yet been determined and systematically applied. This paper presents a new evaluation procedure meant to address this issue and provide insight into the chaining process. Furthermore, the paper exemplarily demonstrates its application for a lexical chainer using GermaNet as a resource. 1 Project Context and Motivation Converting linear text documents into documents publishable in a hypertext environment is a complex task requiring methods for the segmentation, reorganization, and linking. The HyTex project, funded by the DFG, aims at the development of conversion strategies based on text-grammatical features1 . One focus of our work is on topic-based linking strategies using lexical and thematic chains. In contrast to the lexical ones thematic chains are based on a selection of central words, so called topic anchors, which are e.g. words able to outline the content of a complete passage, and as in lexical chaining connected via semantically meaningful edges. An illustration is given in Fig. 1. We intend to use lexical chaining for the construction of thematic chains: on the one hand as a feature for the extraction of topic anchors and on the other hand as a tool for the calculation of thematic structure, as shown in Fig. 1. For this purpose, we implemented a lexical chainer for German corpora based on GermaNet. In order to perform an in-depth analysis and evaluation of this chainer as well as to gain insight into the whole chaining process we developed a detailed evaluation procedure. We argue that this procedure is applicable to any lexical chainer regardless of the algorithm or resources used and helps to fine-tune the parameter setting ideal for a specific application. We also present a detailed evaluation of our own lexical chainer and illustrate the issues and challenges we encountered using GermaNet as a resource. See our project web pages
4th Global WordNet Conference, GWC 2008 2007
This paper reports our recent work to use visualization to present semantic relation for Chinese WordNet. We design a visualization interface, named CWN-Viz, based on "TouchGraph". There are three import design features of this visualization interface: First, visualization is driven by wordform, the most intuitive lexical search unit in Chinese. Second, the CWN-Viz allows visualization of bilingual semantic relations by incorporating Sinica BOW (Bilingual Ontological WordNet) information. Third, the semantic distance of each relation is calculated and used in both clustering and visualization. All rights are reserved.
// NO_PUBLICATIONS_MATCH_THIS_FILTER