Towards Complex and Cross-Domain Text-to-SQL Parsing through Schema Reference Resolution
Abstract:
Parsing natural language to SQL queries (text-to-SQL) is a long-standing problem in natural language processing. Two critical challenges are blocking existing models from practical use. One is to generate complex SQL queries including multiple clauses. The other is to generalize to unseen database schemas. This dissertation proposes to address these two challenges through accurately resolving column references, table references and value references in human utterances. We call these three types of references as schema references. In this dissertation, we identify resolving schema references as the key task of complex and cross-domain text-to-SQL. We find that when schema references are resolved perfectly, the remaining tasks are much easier to address. We show that correct schema reference resolution can help address the query's domain-specific nature, leaving only domain-general SQL artefacts to be generated. When appropriately trained to decode the query along with reference representations, a simple BERT baseline then suffices to achieve state-of-the-art performance. When provisioned with oracular resolution result, our model achieves substantially higher performance on the development set of Spider, the largest existing dataset for complex and cross-domain text-to-SQL task. Our further error analyses show that the model approximates potential upper bound performance, as most remaining errors we examined are not due to model capacity.
Biodata:
Weixin Wang is a master's candidate under the supervision of Prof. Min-Yen Kan and mentorship of Dr.Wenqiang Lei. He is a member of the Web Information Retrieval / Natural Language Processing Group led by Prof. Min-Yen Kan. During his master's candidature, he focuses on the problem of converting natural language into SQL. Specifically, he tried to solve the two most critical challenges of this research topic: cross-domain generalizability and complex query generation, in a unified way. He developed a model named RASQL with a simple architecture based on BERT, achieving rank 5 on the challenging Spider, a large-scale complex and cross-domain text-to-SQL benchmark. He received his B.Eng from East China Normal University and worked as a software engineer at Baidu. Inc for two years.