PH.D DEFENCE - PUBLIC SEMINAR

Developing Data-Driven Information Systems for the Health Domain: Leveraging the Power of NLP

Speaker
Mr. Prakash Chandra Sukhwal
Advisor
Dr Atreyi Kankanhalli, Provost'S Chair Professor, School of Computing


29 Jul 2024 Monday, 11:00 AM to 12:00 PM

TR9, COM2 01-08

Abstract:

In today's digital era, the Internet has become a popular source for fulfilling users' information needs, including the healthcare information needs. However, the abundance of online data presents challenges for users, such as information overload. To address these challenges, there is a pressing need for the development of data-driven information systems (IS) that can effectively analyze vast amounts of online data and provide valuable insights to stakeholders such as policy makers and the public. This thesis, through several instances, shows how to design and develop such datadriven IS artifacts utilizing advanced natural language processing (NLP) techniques to help address users’ information needs. By leveraging the power of NLP, we can extract meaningful information from online content and facilitate decision-making processes. Furthermore, we recognize the importance of comprehensively evaluating the proposed frameworks and IS artifacts, which we do to assess their effectiveness.

We first focus on policy makers’ need for feedback information from the public to formulate and refine their policies. Public feedback e.g., public sentiment, is a key means for governments to validate their policies. For instance, during health crises such as the COVID-19 pandemic, policymakers must develop a range of policies to alleviate the crisis. This requires them to obtain public feedback information e.g., sentiments from social media, for effective policy interventions. However, making sense of vast amounts of public opinions on social media amidst a fast-changing environment can be challenging. Particularly, we study public sentiment and feedback for pandemic containment policies in Singapore using high-frequency data of ~240,000 posts on highly-followed public Facebook groups during Jan–Nov 2020. To do so, we leverage NLP techniques to extract public sentiment and concerns as they originate and are shared on social media platforms. By combining NLP results with robust statistical methods, we demonstrate that policymakers can compute and monitor the feedback, i.e., sentiments and concerns of the public, allowing for evidence-based refinement in policy design and implementation.

Next, we focus on lay users’ need for disease-related information in healthcare. Lay users seeking answers to their disease-related queries often face challenges in navigating and comprehending the vast amounts of medical (including user-generated) data available from online sources. To address these limitations in disease question answering (QA), we propose an approach based on knowledge graph (KG) and knowledge base (KB) creation, domain adaptation, and generative answer generation using language models (LMs) to build an automated disease QA assistant. Extant research QA systems also have limitations in terms of automation and performance. We address these challenges by designing a novel, automated disease QA system which effectively utilizes both LM and KG techniques through a joint-reasoning approach to answer disease-related questions appropriate for lay users. Our evaluation of the system using a range of quality metrics demonstrates its efficacy over benchmark systems, including the popular ChatGPT.

Finally, we build on the proposed disease QA system to design a Conversational Question-Answering (CQA) system tailored for lay users seeking information on chronic diseases. This system extends the earlier disease QA system by designing key functionalities for personalized features and advanced dialogue capabilities to provide a user-centric experience. Our design demonstrates the feasibility and effectiveness of our approach in addressing user queries in the healthcare domain. The existing QA systems face several limitations, including limited dialogue capabilities, ambiguous or incorrect responses, lack of personalization, lack of easy to comprehend answers, absence of chat history utilization to maintain ongoing conversation context and relevance, digression into uncharted territories or hallucinations, inability to motivate or engage users better, lack of readability in responses, and limited knowledge or outdated information. To address these limitations, we systematically fine-tuned a Large Language Model (LLM) with a carefully crafted dialogue dataset on chronic disease conversations. Through this joint reasoning approach and the Retrieval-Augmented Generation (RAG) approach, our CQA system provides correct, complete, and concise answers free from any hallucinations. Additionally, our CQA system includes features such as prompting users to ask more questions for better engagement, explicit personalization based on user profiles, guardrails preventing answers outside the disease domain, and the use of chat history to maintain ongoing conversation context. These features collectively enhance the user utility and provide accurate and relevant information tailored to the user's needs and preferences. Additionally, evaluation metrics were developed for the CQA system. Two user studies (with lay users and medical professionals) using these metrics demonstrated the efficacy of the CQA system. In conclusion, this thesis aims to contribute to the development of data-driven IS artifacts utilizing advanced NLP techniques to address users' information needs. It focuses on domains of public policy related to health, disease QA, and disease CQA, aiming to provide accurate and personalized information and improve decision-making processes. This research strives to advance intelligent systems designs that effectively leverage NLP techniques for addressing users’ information needs in the health domain.