Word sense disambiguation primarily addresses the lexical ambiguity of common words based on a predefined sense inventory. Conversely, proper names are usually considered to denote an ad-hoc real-world referent. Once the reference is decided, the ambiguity is purportedly resolved. However, proper names also exhibit ambiguities through appellativization, i.e., they act like common words and may denote different aspects of their referents. We proposed to address the ambiguities of proper names through the light of regular polysemy, which we formalized as dot objects. This paper introduces a combined word sense disambiguation (WSD) model for disambiguating common words against Chinese Wordnet (CWN) and proper names as dot objects. The model leverages the flexibility of a gloss-based model architecture, which takes advantage of the glosses and example sentences of CWN. We show that the model achieves competitive results on both common and proper nouns, even on a relatively sparse sense dataset. Aside from being a performant WSD tool, the model further facilitates the future development of the lexical resource.
papersource
@article{hsieh_resolving_2024,
title = {Resolving regular polysemy in named entities},
author = {Shu-Kai Hsieh AND Yu-Hsiang Tseng AND Hsin-Yu Chou AND Ching-Wen Yang AND Yu-Yun Chang},
journal = {arXiv preprint arXiv:2401.09758},
year = {2024},
}
Do you believe it happened? Assessing Chinese readers’ veridicality judgments
Yu-Yun Chang, Shu-Kai Hsieh
Proceedings of the Twelfth Language Resources and Evaluation Conference2020
This work collects and studies Chinese readers’ veridicality judgments to news events (whether an event is viewed as happening or not). For instance, in “The FBI alleged in court documents that Zazi had admitted having a handwritten recipe for explosives on his computer”, do people believe that Zazi had a handwritten recipe for explosives? The goal is to observe the pragmatic behaviors of linguistic features under context which affects readers in making veridicality judgments. Exploring from the datasets, it is found that features such as event-selecting predicates (ESP), modality markers, adverbs, temporal information, and statistics have an impact on readers’ veridicality judgments. We further investigated that modality markers with high certainty do not necessarily trigger readers to have high confidence in believing an event happened. Additionally, the source of information introduced by an ESP presents low effects to veridicality judgments, even when an event is attributed to an authority (e.g. “The FBI”). A corpus annotated with Chinese readers’ veridicality judgments is released as the Chinese PragBank for further analysis.
papersource
@inproceedings{chang-hsieh-2020-believe,
title = "Do You Believe It Happened? Assessing {C}hinese Readers' Veridicality Judgments",
author = "Chang, Yu-Yun AND Hsieh, Shu-Kai",
editor = "Calzolari, Nicoletta AND B{\'e}chet, Fr{\'e}d{\'e}ric AND Blache, Philippe AND Choukri, Khalid AND Cieri, Christopher AND Declerck, Thierry AND Goggi, Sara AND Isahara, Hitoshi AND Maegaard, Bente AND Mariani, Joseph AND Mazo, H{\'e}l{\`e}ne AND Moreno, Asuncion AND Odijk, Jan AND Piperidis, Stelios",
booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2020.lrec-1.33/",
pages = "259--267",
language = "eng",
ISBN = "979-10-95546-34-4",
abstract = "This work collects and studies Chinese readers' veridicality judgments to news events (whether an event is viewed as happening or not). For instance, in ``The FBI alleged in court documents that Zazi had admitted having a handwritten recipe for explosives on his computer'', do people believe that Zazi had a handwritten recipe for explosives? The goal is to observe the pragmatic behaviors of linguistic features under context which affects readers in making veridicality judgments. Exploring from the datasets, it is found that features such as event-selecting predicates (ESP), modality markers, adverbs, temporal information, and statistics have an impact on readers' veridicality judgments. We further investigated that modality markers with high certainty do not necessarily trigger readers to have high confidence in believing an event happened. Additionally, the source of information introduced by an ESP presents low effects to veridicality judgments, even when an event is attributed to an authority (e.g. ``The FBI''). A corpus annotated with Chinese readers' veridicality judgments is released as the Chinese PragBank for further analysis."
}
Sentiment detection in micro-blogs using unsupervised chunk extraction
Pierre Magistry, Shu-Kai Hsieh, Yu-Yun Chang
Lingua Sinica2016
In this paper, we present a proposed system designed for sentiment detection for micro-blog data in Chinese. Our system surprisingly benefits from the lack of word boundary in Chinese writing system and shifts the focus directly to larger and more relevant chunks. We use an unsupervised Chinese word segmentation system and binomial test to extract specific and endogenous lexicon chunks from the training corpus. We combine the lexicon chunks with other external resources to train a maximum entropy model for document classification. With this method, we obtained an averaged F1 score of 87.2 which outperforms the state-of-the-art approach based on the released data in the second SocialNLP shared task.
papersource
@article{magistry_sentiment_2016,
title = {Sentiment detection in micro-blogs using unsupervised chunk extraction},
author = {Pierre Magistry AND Shu-Kai Hsieh AND Yu-Yun Chang},
journal = {Lingua Sinica},
year = {2016},
}
CWIKIN: a wiki that helps quicken the development of Chinese Wordnet
The present study proposes an innovative way of expanding the lexical repository of Chinese Wordnet (CWN). Fine-grained as the senses and sense facets of its entries are, the current status of CWN fails to include such high-frequency words as àiqíng (‘love’) and such high-familiarity words as guānxīn (‘to care’) due to the lac of linguistic manpower. In view of this limited inclusion of words on CWN, we propose to extend its lexical knowledge by constructing a wiki for CWN, or CWIKIN—a collaborative platform on which registered users can contribute to CWN by adding new entries, editing existing ones and rating one another’s contribution to ensure the quality of collective intelligence. What distinguishes CWIKIN from a typical wiki is that it presents synonymous sets of Chinese words that are currently only implicitly represented on CWN, and helps users in adding those words as well as various lexical semantic relations by suggesting potential equivalents along with their parts-of-speech, definitions and example sentences bootstrapped from Princeton WordNet via Sinica BOW, a bilingual ontological wordnet. We believe that the proposed platform will facilitate the enrichment of Chinese Lexical Resources in the context of Web collective intelligence and contribute to the advancement of Chinese Lexicography.
paper
@inproceedings{lee_cwikin_2013,
title = {CWIKIN: a wiki that helps quicken the development of Chinese Wordnet},
author = {Chih-Yao Lee AND Yu-Yun Chang AND Shu-Kai Hsieh AND Jia-Fei Hong AND Chu-Ren Huang},
year = {2013},
}
Causing emotion in collocation: An exploratory data analysis
Pei-Yu Lu, Yu-Yun Chang, Shu-Kai Hsieh
Proceedings of the 25th Conference on Computational Linguistics and Speech Processing (ROCLING 2013)2013
This paper aims to seek approaches in investigating the relationships within emotion words under linguistic aspect, rather than figuring out new algorithms or so in processing emotion detection. It is noted that emotion words could be categorized into two groups: emotion-inducing words and emotion-describing words, and emotion-inducing words would be able to trigger emotions expressed via emotion-describing words. Hence, this paper takes the social network Plurk, the emotion words are from the study on Standard Stimuli and Normative Responses of Emotions (SSNRE) in Taiwan and the National Taiwan University Sentiment Dictionary (NTUSD) as corpus, combining with Principle Component Analysis (PCA) and followed collocation approach, in order to make a preliminary exploration in observing the interactions between emotion-inducing and emotion-describing words. From the results, it is found that though the retrieved Plurk posts containing emotion-inducing words, polarities of the induced emotion-describing words contained within the posts are not consistent. In addition, the polarities of posts would not only be influenced by emotion words, but negation words, modal words and certain content words within context.
paper
@inproceedings{lu_causing_2013,
title = {Causing emotion in collocation: An exploratory data analysis},
author = {Pei-Yu Lu AND Yu-Yun Chang AND Shu-Kai Hsieh},
booktitle = {Proceedings of the 25th Conference on Computational Linguistics and Speech Processing (ROCLING 2013)},
year = {2013},
}
Frequency, Collocation, and Statistical Modeling of Lexical Items: A Case Study of Temporal Expressions in Two Conversational Corpora
International Journal of Computational Linguistics & Chinese Language Processing, Volume 17, Number 2, June 2012—Special Issue on Selected Papers from ROCLING XXIII2012
This study examines how different dimensions of corpus frequency data may affect the outcome of statistical modeling of lexical items. Our analysis mainly focuses on a recently constructed elderly speaker corpus that is used to reveal patterns of aging people’s language use. A conversational corpus contributed by speakers in their 20s serves as complementary material. The target words examined are temporal expressions, which might reveal how the speech produced by the elderly is organized. We conduct divisive hierarchical clustering analyses based on two different dimensions of corporal data, namely raw frequency distribution and collocation-based vectors. When different dimensions of data were used as the input, results showed that the target terms were clustered in different ways. Analyses based on frequency distributions and collocational patterns are distinct from each other. Specifically, statistically-based collocational analysis generally produces more distinct clustering results that differentiate temporal terms more delicately than do the ones based on raw frequency. 1 Acknowledgement: Thanks Wang Chun-Chieh, Liu Chun-Jui, Anna Lofstrand, and Hsu Chan-Chia for their involvement in the construction of the elderly speakers’ corpus and the early development of this paper. ∗ Graduate Institute of Linguistics, National Taiwan University, 3F, Le-Xue Building, No. 1, Sec. 4, Roosevelt Rd., Taipei Taiwan, 106 E-mail: {sftwang0416; flower75828; june06029}@gmail.com; shukaihsieh@ntu.edu.tw + Department of English, National Taiwan Normal University, No. 162, He-ping East Road, Section 1, Taipei, Taiwan, 106 E-mail: Yw_L7@hotmail.com 38 Sheng-Fu Wang et al
paper
@article{wang_frequency_2012,
title = {Frequency, Collocation, and Statistical Modeling of Lexical Items: A Case Study of Temporal Expressions in Two Conversational Corpora},
author = {Sheng-Fu Wang AND Jing-Chen Yang AND Yu-Yun Chang AND Yu-Wen Liu AND Shu-Kai Hsieh},
journal = {International Journal of Computational Linguistics & Chinese Language Processing, Volume 17, Number 2, June 2012—Special Issue on Selected Papers from ROCLING XXIII},
year = {2012},
}
Frequency, Collocation, and Statistical Modeling of Lexical Items: A Case Study of Temporal Expressions in an Elderly Speaker Corpus
Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing (ROCLING 2011)2011
This study examines how different dimensions of corpus frequency data may affect the outcome of statistical modeling of lexical items. The corpus used in our analysis is an elderly speaker corpus in its early development, and the target words are temporal expressions, which might reveal how the speech produced by the elderly is organized. We conduct divisive hierarchical clustering based on two different dimensions of corpus data, namely raw frequency distribution and collocation-based vectors. Results show when different dimensions of data were used as the input, the target terms were indeed clustered in different ways. Analyses based on frequency distributions and collocational patterns are distinct from each other. Specifically, statistically-based collocational analysis produces more distinct clustering results that differentiate temporal terms more delicately than do the ones based on raw frequency.
paper
@inproceedings{wang_frequency_2011,
title = {Frequency, Collocation, and Statistical Modeling of Lexical Items: A Case Study of Temporal Expressions in an Elderly Speaker Corpus},
author = {Sheng-Fu Wang AND Jing-Chen Yang AND Yu-Yun Chang AND Yu-Wen Liu AND Shu-Kai Hsieh},
booktitle = {Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing (ROCLING 2011)},
year = {2011},
}