Towards more accurate protein function prediction in the twilight zone
Abstract:
With the advancement of next-generation sequencing technology, more and more protein sequences are being generated day by day and the public databases are overwhelmed with the exponential increase of available sequences. To understand how biological systems operate, the functional assignment of protein sequences is essential and this is one of the highly challenging tasks in biology. As biological experiments require an excessive amount of time and resources to validate the functions of these available and growing protein sequences, we need reliable and automated protein function prediction methods.
Significant efforts have been made in recent times to annotate proteins using computational approaches. These computational approaches try to learn some patterns from experimentally annotated proteins and use these learned patterns to make predictions for unknown or unannotated proteins. Many existing approaches exploit sequence similarity to transfer functional annotation, as proteins sharing the same function also share sequence similarity in many cases. But there exists a significant amount of proteins, known as twilight zone proteins, that have very low sequence similarity with reference proteins with known function. Current protein function prediction methods do not provide very accurate function prediction for proteins in the twilight zone. In this thesis, we propose using dissimilarity information along with similarity features to represent proteins and provide accurate function prediction for twilight zone proteins.
Firstly, we propose EnsembleFam, a novel method aiming at better protein family modeling for twilight zone proteins. EnsembleFam extracts the core characteristics of a protein family using similarity and dissimilarity features calculated from sequence homology relations. We train three separate Support Vector Machine (SVM) classifiers for each protein family and an ensemble prediction is made to identify member proteins. The combination of similarity and dissimilarity features helps EnsembleFam capture essential information of different families, especially for twilight zone proteins. Extensive experiments are conducted using the Clusters of Orthologous Groups (COG) of proteins dataset and the G Protein-Coupled Receptor (GPCR) dataset. EnsembleFam not only outperforms state-of-the-art methods on the overall dataset but also provides a much more accurate prediction for twilight zone proteins.
We propose our second method, e-EnsembleFam, to predict Enzyme Commission (EC) number. Enzymes are hierarchically classified into four levels in EC number. Due to this hierarchical classification, enzymes are more heterogeneous in nature. This makes it more difficult to model using similarity features. We build our e-EnsembleFam models relying more heavily on dissimilarity features. Dissimilarity features help us capture the patterns of differences and heterogeneity of enzymes more effectively than existing approaches. In this thesis, we build models to predict EC Level 3 and EC Level 4 from a given enzyme sequence. We compared the performance of our method with existing approaches on the Swiss-Prot dataset. e-EnsembleFam provides better sensitivity and precision for twilight zone enzymes as well as high similarity ones.
Thirdly, we propose m-EnsembleFam to annotate multi-domain proteins. Annotating multi-domain proteins is a more difficult problem, and many existing approaches can only handle single-domain proteins. Multi-domain protein sequences are inherently longer and contain more than one domain in their sequence which makes annotation difficult for computational approaches. To detect domain boundaries and predict function for these domains, we build our ensemble models with dissimilarity features from five different sources and similarity features from two different sources. We compared our approach with other methods that can handle multi-domain proteins on the multi-domain enzymes from the Swiss-Prot dataset. m-EnsembleFam provides significantly better performance and improved sensitivity in identifying domains of multi-domain enzymes.
Lastly, we illustrate an application where protein annotation tools can help in identifying proteins of interest from a novel genome. Given an input genome, we get gene predictions and convert them into protein sequences to make predictions. Using an example fungal genome, we show that our proposed method provides more extensive predictions than other methods. This helps biologists identify enzymes of interest. In this thesis, we illustrate the power of dissimilarity features in identifying difficult proteins and mapping them to their respective families more effectively and accurately.