Improvement and Evaluation of Genome Assembly

Mr Xie Luyu
Dr Wong Lim Soon, Kithct Chair Professor, School of Computing

  25 Mar 2019 Monday, 03:00 PM to 04:30 PM

 MR3, COM2-02-26


Sequencing is a powerful tool for investigation of genomes. It is instrumental to revealing the underlying mechanisms of many biological processes. In the late 1990s, some high-throughput sequencing techniques called Next-Generation Sequencing(NGS) were developed and commercialized.In the recent decade, the cost of NGS has decreased dramatically. This makes sequencing affordable for a lot of research projects and even some clinical applications.

The reads from sequencers are small fragmented segments of a sample's genome. When a reference genome is absent, de novo assembly is usually performed to provide a global view of the sample's genome. Along with sequencing technologies, assembly methods have also evolved over the last ten years. Different representations were proposed to better describe and model information in sequencing data. However, these assembly methods still often produce assemblies that are fragmented, incomplete, and even contain misassembled segments. This is mainly due to repeat regions in the genome. It is very difficult to uniquely determine the flanking sequence of a repeat region when the repeat is too long to be spanned by a single read. Conservative assemblers leave these flanking sequences as separate contigs, resulting in fragmented assemblies. In contrast, aggressive assemblers try to resolve repeat regions based on subtle clues, at the cost of making misassemblies. Obviously, misassembly as well as poor connectivity hinders downstream analysis. Therefore, there is a need of a better approach to further improve the quality of draft assembly.

To address this, I propose a new method, CAST(Correction And Scaffolding Tool), to improve draft assembly by sequencing data of a progeny population. CAST inspects genetic coherence along contigs in an initial draft assembly. Contigs are split by adjacent sites incoherent to each other, and then merged by coherent sites. In this way, the draft assembly is improved without time-consuming construction of a genetic map. A Hi-C verification and synteny analysis showed that CAST rearranged the draft genome correctly, and took the final step from scaffold-level assembly to chromosome-level assembly.

To better assess genome assembly, I also propose a new metric PDR. It measures the quality of an assembly by the average ratio of the distance between any pair of positions in the reference genome to their distance in the draft assembly. It not only integrates contiguity, completeness, and correctness, but also makes good biological sense.