STRUCTURED INFORMATION EXTRACTION FOR SCIENTIFIC DOCUMENTS
15 Jan 2020 Wednesday, 02:00 PM to 03:30 PM
COM2 Level 4
Executive Classroom, COM2-04-02
We study and propose various structured machine learning approaches for information extraction in variable-sized contexts as present in the long-form of scientific documents. We experiment with IE tasks on scientific texts, which provide challenging and robust use cases due to the availability of structures of varying complexity.
This thesis uses the context size as a loose organizational principle in framing the various IE tasks that we investigate.
We first consider the small-context scenario of reference string, where the context is just within the reference string itself, and gradually increase the extent of the exploitable context. For small contexts -- when modeled as sequential labeling -- we show that Bidirectional Long Short-Term Memory with Conditional Random Fields outperform rich, handcrafted feature baselines. We further explore various modeling aspects within this model, resulting in a state-of-the-art reference string parsing system.
As we increase the context size to encompass a few lines, the sequential model still works reliably but require more resources and occasionally, global features. We focus on two medium-sized context tasks of identifying typed keyphrases from such scientific text excerpts and show how fully-featured and fully neural model perform on these tasks.
For larger contexts, more than a few lines or paragraphs, the token-wise sequential model as an end-to-end solution scales poorly. For such cases, we propose graph-based deep neural networks to address these shortcomings. We imbue structured components from traditional graph ranking models, such as TextRank. The fusing of structural information with deep neural networks results in models which build upon the sophisticated graph-based learners while accounting for traditionally known structures. Our proposed model yields the state-of-the-art results for end-to-end, full-text keyphrase extraction results.
Lastly, we expand the context beyond the document itself. We address the problem of noisy instances in large, real-world citation networks, for our final work on the extensive-sized multi-document context. We enhance the inherent capability of graph convolutional network formalism by using multi-entity graphs. Our proposed technique explores the relationship between authors and documents by using the text signals entirely from the documents. It formalizes a means to deal with characteristic noise in real-world, large-scale, multi-entity graphs. The resultant robust model shows improved performance for keyphrase extraction in partial data scenarios.