Analyzing Lateral Gene Transfer with Machine Learning and Phylogenetic Methods

Ms Lu Bingxin
Dr Leong Hon Wai, Associate Professor, School of Computing

  11 Dec 2017 Monday, 10:00 AM to 11:30 AM

 MR1, COM1-03-19


Lateral gene transfer (LGT), the transfer of genetic materials between two reproductively
isolated organisms, is an important process in evolution. LGT is also related to the spread
of antibiotic resistance and pathogenicity. To further understand the impact of LGT, it is
necessary to characterize the prevalence of LGT quantitatively. In this thesis, we mainly
study three related problems: how to detect large genomic regions originated from LGT;
how to model LGT with phylogenetic networks; and how to detect LGT events. The aim
of our research is to develop computational methods to help solving these problems.

A large contiguous genomic region acquired by LGT is called a genomic island
(GI). The accurate inference of GIs is important for both evolutionary study and medical
research. But the available GI detection methods still do not have desirable performances
and they may not be easily applied on newly sequenced microbial genomes. So
we developed two machine learning methods for better GI detection: GI-SVM which
utilizes one-class SVM based on k-mer frequencies and GI-Cluster which utilizes consensus
clustering based on multiple GI-related evidence. These two methods provide
researchers with better alternative tools to detect GIs. GI-SVM serves as a more sensitive
method for the first-pass detection of GIs. GI-Cluster brings a widely applicable
framework for GI analysis, which can generate more accurate results.

LGT is one kind of reticulate evolutionary events that are suitable to be modeled with
phylogenetic networks. But it is still challenging to reconstruct rooted phylogenetic
networks, including LGT networks. Since the relationships among phylogenetic networks,
phylogenetic trees and clusters serve as a basis for reconstructing phylogenetic
networks, we focus on two fundamental problems arising in network reconstruction:
the tree containment problem (TCP) and the cluster containment problem (CCP). Both
the TCP and CCP are NP-complete. We implemented fast exponential-time programs
for solving the two problems on arbitrary phylogenetic networks. The resulting CCP
program is further extended into a program for fast computation of the Soft Robinson???
Foulds distance between phylogenetic networks. The evaluation results show that these
programs are fast enough for use in practice. So they are likely to be valuable for the
application of phylogenetic networks in LGT modelling and evolutionary genomics.

To detect LGTs, numerous computational methods of different categories have been
developed. However, known estimates obtained from different methods are often discrepant,
and most methods are believed to be complementary. Since there are very
few studies that systematically investigated the complementary performances of diverse
methods in practice, we conducted a case study on cyanobacterial genomes, which have
been well studied in terms of LGT. Our results indicate very low overlap among predictions
from different methods, especially from methods of different categories, which is
consistent with previous discoveries. Therefore, to get more reliable LGT detections, it
is really necessary to prudently apply multiple methods of different kinds and carefully
examine their predictions whenever possible.