Graduate School of Arts and Sciences, Washington University, Saint Louis, Missouri, UNITED STATES OF AMERICA.
In this work, computational methods for the purpose of sequence function prediction based on molecular evolution were developed and tested.
When analyzing molecular sequences using sequence similarity searches, orthologous sequences (diverged by speciation) are more reliable predictors of biological function than paralogous sequences (diverged by gene duplication), because duplication enables functional diversification. The utility of phylogenetic information in high-throughput genome annotation (“phylogenomics”) is widely recognized, but existing approaches are either manual or indirect (not based on phylogenetic trees). Therefore, a procedure for automated phylogenomics using explicit phylogenetic inference was produced.
At the center of a phylogenomic approach stands the inference of gene duplications by comparing the gene tree containing the sequence to be analyzed to a trusted species tree. An algorithm for this purpose was developed. This algorithm exhibits an inferior worst case behavior compared to previously published ones but appears to be superior in most practical cases, partially due to its simplicity.
A major caveat of all phylogenetic analyses is the unreliability of the resulting trees. Therefore, inference of gene duplications is performed over bootstrap-resampled phylogenetic trees to estimate the reliability of the orthology assignments. Additionally, supplementary measures extending the concepts of orthology and paralogy were introduced and assessed for their effectiveness in functional prediction.
The phylogenomic approach developed in this work was tested on the proteomes of the flowering plant Arabidopsis thaliana and the nematode Caenorhabditis elegans. It appears that this approach is particularly useful for the automated detection of first representatives of novel protein subfamilies.