PH.D DEFENCE - PUBLIC SEMINAR

Linguistically-Inclusive Natural Language Processing

Speaker

Mr Samson Tan Min Rong

Advisor

Dr Kan Min-Yen, Associate Professor, School of Computing

12 May 2022 Thursday, 01:00 PM to 02:30 PM

Zoom presentation

Abstract:
Language is a largely social construct, shaped by each community's lived experiences, culture, and language repertoire. However, current natural language processing (NLP) systems fail to account for sociolinguistic variation: the supervised learning paradigm and common NLP practices implicitly assume that all speakers of a language speak a single, "standard" version, with one set of linguistic rules. This is especially damaging to minority language varieties, perpetuating the perception of being "ungrammatical" and "incorrect".

Failing to address this gap predisposes NLP systems to discriminate against minority language communities. This can take the form of disproportionately poor performance or encoding harmful stereotypes (e.g., classifying colloquial varieties as ungrammatical). Hence, this thesis focuses on the issues surrounding sociolinguistic generalization, defined as an NLP system's ability to generalize beyond the language variety it was trained on. In some situations, this can be viewed as the ability to be robust to sociolinguistic variation.

We first demonstrate, using a morphological adversarial attack, that text classification, question answering, and machine translation models are not robust to a common form of sociolinguistic variation: inflectional variation in English. This is particularly worrying, given English's status as a world language: a model's (in)ability to reliably process "non-standard" Englishes greatly impacts the inclusiveness of the overall NLP system. Therefore, we propose a sample-efficient modification to the subword tokenizer that significantly improves the model's robustness. This improvement is achieved without the model ever observing any adversarial examples during training, and generalizes to out-of-domain data.

Another common form of sociolinguistic variation in multilingual societies is code-mixing, where multiple languages are used in a single sentence. To expose the inability of current multilingual models to handle extreme code-mixing, we construct two multilingual adversarial attacks. Despite claims of zero-shot cross-lingual transfer, we find that model performance drops substantially even in the bilingual case. We then improve worst-case performance using a variant of adversarial training that requires the same number of steps as conventional fine-tuning. Our adversarial training method not only improves robustness to adversaries constructed with unseen embedded languages, but also achieves this without sacrificing performance on monolingual examples. This contrasts with existing work arguing for the existence of a trade-off between robustness and accuracy. We further demonstrate that these improvements transfer to real code-mixed Twitter data.

We conclude by generalizing the prior adversarial attacks into a framework for testing NLP system reliability in the presence of language variation. We posit that any sociolinguistic environment can be viewed as an n-dimensional subspace of variation, with each dimension corresponding to a particular type of linguistic variation. Examples can then be randomly sampled from the distributions modeling these dimensions to evaluate average-case performance, or adversarially sampled to measure worst-case performance. Natural language technology is often hailed as an avenue of improving technological accessibility. This thesis strives for a world in which language technology not only works for the privileged, but for everyone --- regardless of social, cultural, and linguistic background.