ADDRESSING INFORMALITY IN PROCESSING CHINESE MICROTEXT
COM1 Level 3
MR1, COM1-03-19
closeABSTRACT
In this thesis, I tackle the problem of processing Chinese microtext, with the goal of building the natural language processing (NLP) tools for the microtext domain. I discover that informal words and named entities that are formed in a free-style manner are key reasons why microtext is diffi- cult to understand and process by conventional nature language processing tools. As such in this thesis, I study three key areas to address informality in processing Chinese microtext:
1. informal word recognition and word segmentation,
2. informal word normalization, and
3. named entity recognition.
The first area allows us to identify the unknown, informal words formed from ordinary Chinese characters, resulting in improved word segmenta- tion. By leveraging my observation of the mutual dependence between informal word recognition and word segmentation, I formulate the problem as a two-layer sequential labeling problem for which a factorial conditional random field is used to perform both tasks jointly. This joint inference method significantly outperforms baseline systems that conduct the tasks individually or sequentially.
The second area links informal words to their formal counterparts, which can help both human and machine better understand these informal replacements. I formalize the task as a classification problem and propose rule-based and statistical features to model three plausible channels that explain the connection between formal/informal pairs. I evaluate my two- stage selection-classification model on a crowdsourced corpus, achieving a normalization precision of 89.5% across the different channels, significantly improving the state-of-the-art.
The third area targets the important class of words in common in mi- crotext: named entities. I propose an effective method to obtain annota- tions for named entities automatically and employ the conditional random field to label named entities in microtext, using features derived from both labeled and unlabeled data. To further improve the performance gains derived from the automatic annotations, my method caters for the time- sensitivity nature of named entities, thus keeping the model up-to-date.