PH.D DEFENCE - PUBLIC SEMINAR

Applied Abstractive Summarization: Multiple Inputs, Controlled Outputs, And Evaluation

Speaker
Ms. Shen Chenhui
Advisor
Dr You Yang, Nus Presidential Young Professor, School of Computing


19 Aug 2024 Monday, 03:30 PM to 05:00 PM

MR20, COM3-02-59

Abstract:

The automatic summarization task is very useful for many real-life scenarios. For instance, as consumers who make online purchases, it would be useful for us to have an accurate and succinct summary of the thousands of reviews on the same product; as curious members of society who wish to be up-to-date with news and events, it would be useful for us to have various articles from different news agencies summarized into a clear overview. Manually writing the summaries for the above scenarios can be time-consuming and expensive, and hence there is a strong need to study automatic summarization.

However, the conventional summarization task has limited use for the above scenarios. Traditional summarization tasks only take in a single input passage and may not handle the complex interactions amongst multiple input sources. Moreover, they may not be able to customize the output summary according to different user requirements. For instance, while the ordinary public may be more interested in the storyline of an event, an educated reader may be more interested in the aftermath and potential implications. With traditional automatic summarization, different groups of people will get the same standardized summary. In addition, for several
summaries generated from the same input source, there is a lack of efficient methods to find the best summary suited for a specific purpose. Automatic evaluation metrics may not capture the wide range of equally good summaries, whereas human evaluations are often costly and time-consuming.

In this thesis, we focus on three sub-tasks of automatic summarization for better application, namely, enabling multiple inputs (a.k.a., multi-document summarization (MDS)), enabling flexible outputs to suit different requirements (a.k.a., controllable summarization), and summarization evaluation. We utilize pre-trained language models (PLMs) for the above tasks.

This thesis begins with the task of multi-document summarization. Specifically, we explore adapting PLMs for the MDS task. Although PLMs are strong performers for single-document summarization (SDS), this capability may not be sufficiently transferred to MDS. This is due to the complex cross-document interactions inherent to the MDS tasks. We design a simple yet effective method to hierarchize the attention mechanisms of PLMs, yet without diverging too far from the pre-training architecture. Thus, without the need for additional pre-training, PLMs can be directly fine-tuned on MDS datasets for better performances.

Next, this thesis explores the task of controllable summarization. We propose a new task of structure-controlled summarization, where we can have fine-grained control of the summary’s meta-structure. We annotate a dataset in the peer-review domain and show that simple fine-tuning has already equipped the PLM with a reasonable degree of controllability. We further design an inference-time method to strengthen the controllability of the PLM by combining various sampling and search algorithms. Eventually, we can achieve a very high degree of control without compromising the summary quality.

Finally, we analyze evaluation alternatives for summarization. Existing evaluation approaches often require "gold" reference summaries to be compared with, or use black-box neural models. However, in real-life scenarios "gold" is often not readily available, whereas neural models may be insufficient for the increasingly complex evaluation criteria required for different user demands. With the recent advances in large language models (LLMs), it is very tempting to use LLM for direct evaluation. We employ settings inspired by actual human evaluation procedures and analyze the performance and reliability of LLMs for evaluating machine-generated summaries. While LLMs are generally more aligned with expert annotators than other automatic metrics, we discover that they have significant limitations. As the quality of the summary increases, the LLMs become less aligned with humans; moreover, the alignment is not equal for all summarization models, indicating that the use of LLMs for direct evaluation and comparison of different models may be unfair. LLMs also have varying capabilities in different evaluating dimensions, thus even though LLMs may be suitable for a specific evaluation dimension, they may fail on other dimensions.

Overall, this thesis arches over three important aspects of automatic summarization, namely, input, output, and evaluation. The tasks studied are crucial for automatic summarization to be practically applied to real-life usage scenarios. The success of these studies would better facilitate artificial intelligence to accommodate our everyday needs.