<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article
  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="discussion" dtd-version="3.0" xml:lang="EN"><front><journal-meta><journal-id journal-id-type="publisher-id">plos</journal-id><journal-id journal-id-type="publisher">pcbi</journal-id><journal-id journal-id-type="flc">plcb</journal-id><journal-id journal-id-type="nlm-ta">PLoS Comput Biol</journal-id><journal-id journal-id-type="pmc">ploscomp</journal-id><!--===== Grouping journal title elements =====--><journal-title-group><journal-title>PLoS Computational Biology</journal-title></journal-title-group><issn pub-type="ppub">1553-734X</issn><issn pub-type="epub">1553-7358</issn><publisher><publisher-name>Public Library of Science</publisher-name><publisher-loc>San Francisco, USA</publisher-loc></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.1371/journal.pcbi.0020077</article-id><article-id pub-id-type="publisher-id">06-PLCB-EN-0087R2</article-id><article-id pub-id-type="sici">plcb-02-06-19</article-id><article-categories><subj-group subj-group-type="heading"><subject>Education</subject></subj-group><subj-group subj-group-type="Discipline"><subject>Computational Biology</subject></subj-group><subj-group subj-group-type="System Taxonomy"><subject>None</subject></subj-group></article-categories><title-group><article-title>Functional Classification Using Phylogenomic Inference</article-title><alt-title alt-title-type="running-head">Education</alt-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Brown</surname><given-names>Duncan</given-names></name><xref ref-type="fn" rid="n104"/></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Sjölander</surname><given-names>Kimmen</given-names></name><xref ref-type="corresp" rid="cor1">
            <sup>*</sup>
          </xref><xref ref-type="fn" rid="n104"/></contrib></contrib-group><contrib-group><contrib contrib-type="editor" xlink:type="simple"><name name-style="western"><surname>Lewitter</surname><given-names>Fran</given-names></name><role>Editor</role><xref ref-type="aff" rid="edit1"/></contrib></contrib-group><aff id="edit1">Whitehead Institute, United States of America</aff><author-notes><fn fn-type="con" id="ack1"><p>DB and KS wrote the paper.</p></fn><corresp id="cor1">* To whom correspondence should be addressed. E-mail: <email xlink:type="simple">kimmen@berkeley.edu</email></corresp><fn fn-type="current-aff" id="n104"><p>Duncan Brown and Kimmen Sjölander are at the University of California Berkeley, Berkeley, California, United States of America.</p></fn><fn fn-type="conflict" id="ack3"><p> The authors have declared that no competing interests exist.</p></fn></author-notes><pub-date pub-type="ppub"><month>6</month><year>2006</year></pub-date><pub-date pub-type="epub"><day>30</day><month>6</month><year>2006</year></pub-date><volume>2</volume><issue>6</issue><elocation-id>e77</elocation-id><!--===== Grouping copyright info into permissions =====--><permissions><copyright-year>2006</copyright-year><copyright-holder>Brown and Sjölander</copyright-holder><license><license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</license-p></license></permissions><funding-group><funding-statement>The authors received no specific funding for this article.</funding-statement></funding-group><counts><page-count count="5"/></counts><!--===== Restructure custom-meta-wrap to custom-meta-group =====--><custom-meta-group><custom-meta><meta-name>citation</meta-name><meta-value>Brown D, Sjölander K (2006) Functional classification using phylogenomic inference. PLoS Comput Biol 2(6): e77. DOI: <ext-link ext-link-type="doi" xlink:href="http://dx.doi.org/10.1371/journal.pcbi.0020077" xlink:type="simple">10.1371/journal.pcbi.0020077</ext-link></meta-value></custom-meta></custom-meta-group></article-meta></front><body><sec id="s1"><title/><p>Phylogenomic inference of protein (or gene) function attempts to address the question, <italic>“What function does this protein perform?”</italic> in an evolutionary context. As originally outlined by Jonathan Eisen [<xref ref-type="bibr" rid="pcbi-0020077-b001">1</xref>–<xref ref-type="bibr" rid="pcbi-0020077-b003">3</xref>], phylogenomic inference of protein function is a multistep process involving selection of homologs, multiple sequence alignment (MSA), and phylogenetic tree construction; overlaying annotations on the tree topology; discriminating between orthologs and paralogs; and—finally—inferring the function of a protein based on the orthologs identified by this process and the annotations retrieved. <xref ref-type="fig" rid="pcbi-0020077-g001">Figure 1</xref> shows an example of using annotated subfamily groupings to infer function, in a manner similar to [<xref ref-type="bibr" rid="pcbi-0020077-b001">1</xref>]. One of us, while at Celera Genomics, separately came up with a similar approach for the functional classification of the human genome [<xref ref-type="bibr" rid="pcbi-0020077-b004">4</xref>], based on the automated identification of functional subfamilies using the SCI-PHY algorithm and the use of subfamily hidden Markov models (HMMs) to classify novel sequences [<xref ref-type="bibr" rid="pcbi-0020077-b005">5</xref>,<xref ref-type="bibr" rid="pcbi-0020077-b006">6</xref>]. Our experiences over the past several years in developing computational pipelines for automating phylogenomic inference at the genome scale [<xref ref-type="bibr" rid="pcbi-0020077-b007">7</xref>]—and the challenges we have faced in this effort—motivate this paper.</p><fig id="pcbi-0020077-g001" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020077.g001</object-id><label>Figure 1</label><caption><title>Phylogenomic Analysis of Protein Function Using Subfamily Annotation</title><p>In the example shown above, a phylogenetic tree has been constructed for a set of G protein–coupled receptors. The molecular function of some of the members of the family has been determined experimentally and is used to annotate individual subfamilies, similar to [<xref ref-type="bibr" rid="pcbi-0020077-b001">1</xref>]. Sequences without known function can be assigned a predicted molecular function using the tree topology to identify orthologs. When no experimental evidence is available for a subtree's molecular function (e.g., the <italic>Unknown Subtype</italic> subtree at top), the annotation would be left at a general level (e.g., “GPCR of unknown specificity, related to opioid, galanin, and somatostatin receptors”). By contrast, if the <italic>Unknown Subtype</italic> subtree were nested within a subtree whose members were consistently characterized, such as opioid receptors, a “subtree neighbors” approach could be used to assign the annotation “Putative opioid receptor” to that group [<xref ref-type="bibr" rid="pcbi-0020077-b014">14</xref>]. The use of subfamilies as the basis of phylogenomic inference is only one approach; as noted in the text, the general methodology does not rely on subfamily groupings and would ideally use the entire tree topology.</p></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020077.g001" xlink:type="simple"/></fig><p>In practice, phylogenomic inference of gene function is not often used. Far from it. The majority of novel sequences are assigned a putative function through the use of annotation transfer from the top hits in a database search. In our analysis of over 300,000 proteins in the UniProt database, only 3% of proteins with informative annotations (i.e., those not labelled as “hypothetical” or “unknown”) had experimental support for their annotations; 97% were annotated using electronic evidence alone. These annotations are uploaded to GenBank, where they persist even if they are eventually determined to be in error.</p><p>The systematic errors associated with this annotation protocol have been pointed out by numerous investigators over the years [<xref ref-type="bibr" rid="pcbi-0020077-b008">8</xref>–<xref ref-type="bibr" rid="pcbi-0020077-b010">10</xref>]. The root causes of these errors are these:</p><p><named-content content-type="genus-species" xlink:type="simple">Gene duplication</named-content>. This enables protein superfamilies to innovate novel functions on the same structural template, so that the top database hit may have a function distinct from the query.</p><p><named-content content-type="genus-species" xlink:type="simple">Domain shuffling</named-content>. Domain fusion and fission events add an additional layer of complexity, as a query and database hit may share only a local region of homology and thus have entirely different molecular functions and structures.</p><p><italic>Propagation of existing errors in database annotations</italic>. This is particularly pernicious, as existing annotation errors are seldom detected and, even if detected, are not necessarily corrected.</p><p><named-content content-type="genus-species" xlink:type="simple">Evolutionary distance</named-content>. Two proteins can share a common ancestor and domain structure, yet have very different functions simply due to their presence in very divergently related species.</p><p>Phylogenomic analysis, properly applied, avoids these errors and provides a mechanism for detecting existing database annotation errors [<xref ref-type="bibr" rid="pcbi-0020077-b003">3</xref>,<xref ref-type="bibr" rid="pcbi-0020077-b007">7</xref>]. Why then is phylogenomic inference not used more widely? We believe this is due to four reasons. First, the actual frequency of annotation error is not known, so the gravity of the situation is not recognized. Second, phylogenomic inference is a much more complicated endeavor than a simple database search and requires significantly more expertise and computing resources. It is therefore not easily applied at the genome scale. Third, millions of dollars and years of effort have been poured into developing computational annotation systems that depend on annotation transfer from top database hits, perhaps overlaid with domain prediction methods such as PFAM or the NCBI CDD [<xref ref-type="bibr" rid="pcbi-0020077-b011">11</xref>,<xref ref-type="bibr" rid="pcbi-0020077-b012">12</xref>]. Fourth, phylogenomic approaches to protein function prediction have arisen only in the last few years, while database search methods have been available for much longer. Revolutions do not normally take place overnight. These four reasons result in phylogenomic inference being applied on a one-off basis, for a few protein superfamilies here and there.</p><p>This may be about to change. A variety of software tools and algorithms enabling phylogenomic inference have been developed in recent years (see <xref ref-type="table" rid="pcbi-0020077-t001">Table 1</xref>). Some of these methods have based annotation transfer on the identification of orthologs [<xref ref-type="bibr" rid="pcbi-0020077-b013">13</xref>–<xref ref-type="bibr" rid="pcbi-0020077-b015">15</xref>] or of functional subfamilies [<xref ref-type="bibr" rid="pcbi-0020077-b006">6</xref>,<xref ref-type="bibr" rid="pcbi-0020077-b016">16</xref>–<xref ref-type="bibr" rid="pcbi-0020077-b021">21</xref>]. Other groups have used whole-tree analyses [<xref ref-type="bibr" rid="pcbi-0020077-b022">22</xref>–<xref ref-type="bibr" rid="pcbi-0020077-b024">24</xref>]. Still other groups employ expert knowledge to define functional subtypes and then develop statistical models to allow users to classify novel sequences [<xref ref-type="bibr" rid="pcbi-0020077-b025">25</xref>,<xref ref-type="bibr" rid="pcbi-0020077-b026">26</xref>]; these expert system-based approaches are unfortunately limited by the scarcity of experimental data for most protein families.</p><table-wrap content-type="2col" id="pcbi-0020077-t001" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020077.t001</object-id><label>Table 1</label><caption><p>Resources for Phylogenomic Analysis</p></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020077.t001" xlink:type="simple"/><!-- <table frame="hsides" rules="none"><colgroup><col id="tb1col1" align="left" charoff="0" char=""/><col id="tb1col2" align="left" charoff="0" char=""/><col id="tb1col3" align="left" charoff="0" char=""/></colgroup><thead><tr><td align="left"><hr/>Database</td><td><hr/>URL</td><td><hr/>Description</td></tr></thead><tbody><tr><td>Astral</td><td><ext-link ext-link-type="uri" xlink:href="http://astral.berkeley.edu">http://astral.berkeley.edu</ext-link></td><td>Provides subsets of SCOP domains filtered to reduce redundancy at various levels of percent identity. Used to evaluate protein structure prediction methods.</td></tr><tr><td>COG</td><td><ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/COG">http://www.ncbi.nlm.nih.gov/COG</ext-link></td><td>Abbreviation for Clusters of Orthologous Groups. Clusters genes into orthologous groups based on reciprocal BLAST analysis.</td></tr><tr><td>GO</td><td><ext-link ext-link-type="uri" xlink:href="http://www.geneontology.org">http://www.geneontology.org</ext-link></td><td>Abbreviation for Gene Ontology. Presents hierarchical graph of terms describing gene molecular function in three areas: molecular function, biological process, and cellular localization.</td></tr><tr><td>GOA</td><td><ext-link ext-link-type="uri" xlink:href="http://www.ebi.ac.uk/GOA">http://www.ebi.ac.uk/GOA</ext-link></td><td>Abbreviation for GO Annotation project. Annotates genes and protein sequences with GO terms.</td></tr><tr><td>NCBI CDD</td><td><ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/cdd/cdd.shtml">http://www.ncbi.nlm.nih.gov/cdd/cdd.shtml</ext-link></td><td>Abbreviation for Conserved Domain Database. Profiles modeling protein domains; the CDD can be searched automatically during BLAST submission.</td></tr><tr><td>Orthostrapper</td><td><ext-link ext-link-type="uri" xlink:href="http://orthostrapper.cgb.ki.se">http://orthostrapper.cgb.ki.se</ext-link></td><td>Assesses orthology between sequences of interest using a confidence value based on bootstrap tree resampling.</td></tr><tr><td>Panther</td><td><ext-link ext-link-type="uri" xlink:href="http://www.pantherdb.org">http://www.pantherdb.org</ext-link></td><td>Classifies proteins using HMMs into curated functional families and subfamilies.</td></tr><tr><td>PFAM</td><td><ext-link ext-link-type="uri" xlink:href="http://pfam.wustl.edu">http://pfam.wustl.edu</ext-link></td><td>Abbreviation for Protein Family Database. Uses HMMs representing conserved functional and structural domains.</td></tr><tr><td>PhyloFacts</td><td><ext-link ext-link-type="uri" xlink:href="http://phylogenomics.berkeley.edu/UniversalProteome">http://phylogenomics.berkeley.edu/UniversalProteome</ext-link></td><td>Provides structural and phylogenomic analysis of over 7,000 domains and full-length protein superfamilies. Includes GO terms and evidence codes, searchable HMMs for subfamilies and families, and a variety of bioinformatics analyses.</td></tr><tr><td>RIO</td><td><ext-link ext-link-type="uri" xlink:href="http://www.rio.wustl.edu">http://www.rio.wustl.edu</ext-link></td><td>Abbreviation for Resampled Inference of Orthologs. Provides estimates of the reliability of orthology assignments using bootstrap trees.</td></tr><tr><td>SCOP</td><td><ext-link ext-link-type="uri" xlink:href="http://scop.berkeley.edu">http://scop.berkeley.edu</ext-link></td><td>Abbreviation for Structural Classification of Proteins. Places structural domains into a hierarchical classification based on structural topology and evolutionary history.</td></tr><tr><td>SFLD</td><td><ext-link ext-link-type="uri" xlink:href="http://sfld.rbvi.ucsf.edu">http://sfld.rbvi.ucsf.edu</ext-link></td><td>Abbreviation for Structure&ndash;Function Linkage Database. Classifies diverse protein superfamilies by conserved chemical reaction mechanism.</td></tr><tr><td>SIFTER</td><td><ext-link ext-link-type="uri" xlink:href="http://sifter.berkeley.edu">http://sifter.berkeley.edu</ext-link></td><td>Abbreviation for Statistical Inference of Function Through Evolutionary Relationships. Propagates functional annotations across a tree topology using a noisy-or model of functional evolution.</td></tr><tr><td>TREEFAM</td><td><ext-link ext-link-type="uri" xlink:href="http://www.treefam.org">http://www.treefam.org</ext-link></td><td>Abbreviation for Tree Families Database. Provides phylogenetic trees and orthology predictions for animal gene families.</td></tr><tr><td>UniProt</td><td><ext-link ext-link-type="uri" xlink:href="http://www.pir.uniprot.org">http://www.pir.uniprot.org</ext-link></td><td>Is a high-quality repository of protein sequence information, including external links to, e.g., references, Protein Database structures, GO terms, and predicted domains.</td></tr></tbody></table> --><!-- --></table-wrap><p>It is worth examining the assumptions underlying these phylogenomic resources, and phylogenomic inference as a whole.</p></sec><sec id="s2"><title>Tree Topology Accuracy</title><p>Phylogenomic inference is based on a fundamental assumption: the phylogenetic tree topology used as the basis of functional inference is correct. This assumption must be questioned, particularly when highly divergent sequences (e.g., with pairwise identities less than 25%) are included in a tree.</p><p>Protein superfamilies provide distinct challenges to phylogenetic reconstruction. Following gene duplication, proteins can undergo significant structural and functional changes associated with neofunctionalization, resulting in loop regions and other parts of protein structures not being strictly homologous across all members of a multigene family (see <xref ref-type="fig" rid="pcbi-0020077-g002">Figure 2</xref>). Even among orthologs, evolutionary rates can vary greatly within different lineages [<xref ref-type="bibr" rid="pcbi-0020077-b027">27</xref>,<xref ref-type="bibr" rid="pcbi-0020077-b028">28</xref>]. This degree of extreme structural and sequence diversity clearly violates the assumptions of most simple (and therefore computationally tractable) models of evolution.</p><fig id="pcbi-0020077-g002" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020077.g002</object-id><label>Figure 2</label><caption><title>Structural and Functional Differences in Distantly Related Protein Superfamilies</title><p>The three proteins shown above are all members of the Structural Classification of Proteins (SCOP) scorpion toxin–related superfamily. All retain the same basic fold, but have significantly divergent functions. They function as part of the innate immune arsenal in plants and insects, but form part of the offense in scorpions. Evolution has conserved the basic structure, but many residues within the sequences are not structurally superposable. Such positions, often in the loop regions, can be significant in determining function.</p></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020077.g002" xlink:type="simple"/></fig><p>Assessing the expected accuracy of phylogenetic methods for protein superfamily reconstruction is a challenge in itself. Unlike phylogenetic reconstruction of species trees, where fossil evidence exists to help investigators assess tree accuracy, we have no fossil record for protein superfamilies. Simulation studies have tended to assume models of molecular evolution that are appropriate to single orthologous DNA sequences [<xref ref-type="bibr" rid="pcbi-0020077-b029">29</xref>], but do not normally address many of the complexities of protein multigene family evolution. This has begun to change; models have been introduced that incorporate a wider range of information, such as indel evolution and structural constraints [<xref ref-type="bibr" rid="pcbi-0020077-b030">30</xref>–<xref ref-type="bibr" rid="pcbi-0020077-b033">33</xref>]. Still, we believe there is a long way to go in this regard before simulation studies can effectively assess the expected accuracy of phylogenetic inference in protein superfamilies.</p><p>An additional complication in phylogenetic reconstruction of protein families is the almost universal dependence on an accurate MSA as input. Studies of alignment accuracy for pairs of proteins at different levels of evolutionary and structural divergence show dramatic increases in alignment errors with sequence divergence [<xref ref-type="bibr" rid="pcbi-0020077-b034">34</xref>]. Several recent methods have bypassed this issue by concurrent estimation of a phylogeny and an MSA from unaligned sequences [<xref ref-type="bibr" rid="pcbi-0020077-b035">35</xref>–<xref ref-type="bibr" rid="pcbi-0020077-b037">37</xref>]; we look forward to future developments in this area.</p><p>Another barrier to the use of phylogenomic inference methods is their computational complexity. Owing to the large size of protein superfamilies (with hundreds or thousands of taxa), many applications of phylogenomic inference employ fast distance-based methods instead of character-based approaches or forego even simple models of evolution in favor of faster hierarchical clustering algorithms (e.g., the Panther system [<xref ref-type="bibr" rid="pcbi-0020077-b038">38</xref>]). Without an objective understanding of the expected accuracy of individual phylogenetic tree estimation methods under different conditions, we cannot know whether functional inferences based on these analyses are accurate.</p><p>In practice, assessing the likely accuracy of a particular tree is typically accomplished through bootstrap analysis or comparison of trees constructed using different phylogenetic reconstruction methods. Analysis of multiple trees for a given family often shows regions of agreement as well as significant differences of opinion: closely related subtrees are often found consistently across different methods, with primary differences between trees being at the coarse branching order between these conserved subtrees. Functional inferences can then be based on subtrees with high bootstrap support or on those subtrees that are found in the strict or majority consensus of several tree methods. However, these methods of analysis are quite time consuming and impractical for large datasets or for high-throughput application.</p></sec><sec id="s3"><title>The Reliability and Source of Existing Database Annotations</title><p>Any system of functional inference depends on the accuracy of the characterized members. The Gene Ontology Consortium has provided a mechanism whereby sequence annotations have associated <italic>evidence codes,</italic> documenting the origin of the annotation (e.g., by electronic means, by direct assay, or by a traceable author statement) [<xref ref-type="bibr" rid="pcbi-0020077-b039">39</xref>]. We believe that annotation transfer, even in a phylogenomic context, should only be performed when solid <italic>experimental</italic> support is available. Our analysis of more than 300,000 proteins in the UniProt database shows only 3% of proteins with functional annotations have experimental support. We suspect that many more proteins than these have been experimentally pursued, but that the results of these experiments are not being propagated efficiently (or at all) to the sequence databases or to the GO Annotation project [<xref ref-type="bibr" rid="pcbi-0020077-b040">40</xref>]. One reason for this is the lack of proper usage of standard sequence identifiers in the biological literature, and we applaud the efforts at various journals to improve this status quo (see, e.g., <italic>Genome Research</italic> and the PLoS journals). We would go further and recommend that sequence databases specifically encourage ontology annotation during sequence submission. We expect that advances in text-mining software will also help correct the imbalance, although the field is not yet at a point to contribute on a large scale [<xref ref-type="bibr" rid="pcbi-0020077-b041">41</xref>]. Finally, we believe that mechanisms must be put into place to enable annotation errors to be more easily corrected. The UniProt database responds to community requests for annotation error correction; other sequence databases might do well to follow their lead.</p></sec><sec id="s4"><title>Functional Inference Based on Assumed Orthology</title><p>Orthologs—genes or proteins related by speciation—are generally assumed to have greater functional similarity than paralogs, which are related by gene duplication. However, inference accuracy also depends on evolutionary distance and the particular functional attribute under consideration. Some attributes of protein families, such as the three-dimensional structure, persist across large evolutionary distances. Other attributes, such as substrate specificity, can be modified based on a handful of amino acid substitutions in critical positions. The persistence of certain traits may be more limited in some families and more expansive in others. The assumption that orthology implies a functional similarity must therefore be tempered by an assessment of evolutionary distance [<xref ref-type="bibr" rid="pcbi-0020077-b042">42</xref>,<xref ref-type="bibr" rid="pcbi-0020077-b043">43</xref>].</p><p>Moreover, determining orthology is not always straightforward. RIO and Orthostrapper take the approach of using phylogenetic trees to assess orthology between homologs [<xref ref-type="bibr" rid="pcbi-0020077-b014">14</xref>,<xref ref-type="bibr" rid="pcbi-0020077-b015">15</xref>]. This is clearly the most accurate method, although accuracy will depend on the estimated phylogeny. However, these methods require estimation of a new tree for each family of interest, and trees must be recomputed whenever novel sequences are added to the family. This limits their application in large-scale endeavors. The COG database makes the simplifying assumption that proteins are orthologs if they are reciprocal top BLAST hits [<xref ref-type="bibr" rid="pcbi-0020077-b013">13</xref>], but this limits the resulting relationships, and domain-shuffling, high sequence diversity within the family, and incomplete genome sequencing can all contribute to error.</p><p>Finally, the dearth of experimental evidence supporting functional annotations, together with ambiguous tree topology reconstruction, often limits the number of proteins that can be annotated effectively based strictly on orthology. Because of the limitations in restricting functional annotations to orthologs, methods have been developed to allow functional inference to extend beyond the strict confines of orthology. The SIFTER algorithm enables annotations to be propagated over a phylogenetic tree, using GO annotations and priors over existing annotations [<xref ref-type="bibr" rid="pcbi-0020077-b022">22</xref>]. We believe this Bayesian approach shows great promise in automating the functional annotation of novel sequences.</p></sec><sec id="s5"><title>The Future of Phylogenomic Inference</title><p>We have focused in this paper on the use of phylogenomic inference of protein function. However, phylogenomic inference can be applied to a wide array of protein family attributes. Selection of templates for comparative model construction can be performed in a phylogenomic context, e.g., picking the template that has the smallest tree distance to a target of unknown structure. Phylogenomic inference of pathway involvement may also be possible under some circumstances, for instance, in cases in which a subtree contains orthologs in closely related species.</p><p>Looking to the future of phylogenomic analysis, we believe that the greatest improvement to this field will take place when investigators have access to rigorously validated biological data through which phylogenomic methods can be assessed for accuracy. The Structure Function Linkage Database [<xref ref-type="bibr" rid="pcbi-0020077-b044">44</xref>], which links protein structures with detailed information on partial chemical reactions, is an important contribution in this regard. Carefully designed benchmark datasets, such as those developed by the protein structure prediction community (e.g., the Astral datasets [<xref ref-type="bibr" rid="pcbi-0020077-b045">45</xref>] and SCOP [<xref ref-type="bibr" rid="pcbi-0020077-b046">46</xref>]), as well as the international biennial CASP experiment [<xref ref-type="bibr" rid="pcbi-0020077-b047">47</xref>], have the potential to transform the field. The protein structure prediction field is one of the most mature in all of computational biology, and we believe this is due (at least in part) to the availability of challenging benchmark datasets and international experiments. The phylogenomic community needs analogous datasets appropriate for our own development and maturation. The natural competitiveness of computational biologists is used to good measure when we can push our methods to ever-increasing levels of accuracy. </p></sec></body><back><ack><p>The authors would like to thank an anonymous reviewer for very helpful comments.</p></ack><glossary><title>Abbreviations</title><def-list><def-item><term>HMM</term><def><p>hidden Markov model</p></def></def-item><def-item><term>MSA</term><def><p>multiple sequence alignment</p></def></def-item><def-item><term>SCOP</term><def><p>Structural Classification of Proteins</p></def></def-item></def-list></glossary><ref-list><title>References</title><ref id="pcbi-0020077-b001"><label>1</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Eisen</surname><given-names>JA</given-names></name><name name-style="western"><surname>Sweder</surname><given-names>KS</given-names></name><name name-style="western"><surname>Hanawalt</surname><given-names>PC</given-names></name></person-group>
					<year>1995</year>
					<article-title>Evolution of the SNF2 family of proteins: Subfamilies with distinct sequences and functions.</article-title>
					<source>Nucleic Acids Res</source>
					<volume>23</volume>
					<fpage>2715</fpage>
					<lpage>2723</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b002"><label>2</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Eisen</surname><given-names>JA</given-names></name><name name-style="western"><surname>Kaiser</surname><given-names>D</given-names></name><name name-style="western"><surname>Myers</surname><given-names>RM</given-names></name></person-group>
					<year>1997</year>
					<article-title>Gastrogenomic delights: A movable feast.</article-title>
					<source>Nat Med</source>
					<volume>3</volume>
					<fpage>1076</fpage>
					<lpage>1078</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b003"><label>3</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Eisen</surname><given-names>JA</given-names></name></person-group>
					<year>1998</year>
					<article-title>Phylogenomics: Improving functional predictions for uncharacterized genes by evolutionary analysis.</article-title>
					<source>Genome Res</source>
					<volume>8</volume>
					<fpage>163</fpage>
					<lpage>167</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b004"><label>4</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Venter</surname><given-names>JC</given-names></name><name name-style="western"><surname>Adams</surname><given-names>MD</given-names></name><name name-style="western"><surname>Myers</surname><given-names>EW</given-names></name><name name-style="western"><surname>Li</surname><given-names>PW</given-names></name><name name-style="western"><surname>Mural</surname><given-names>RJ</given-names></name><etal/></person-group>
					<year>2001</year>
					<article-title>The sequence of the human genome.</article-title>
					<source>Science</source>
					<volume>291</volume>
					<fpage>1304</fpage>
					<lpage>1351</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b005"><label>5</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Sjölander</surname><given-names>K</given-names></name></person-group>
					<year>1998</year>
					<article-title>Phylogenetic inference in protein superfamilies: Analysis of SH2 domains.</article-title>
					<source>Proc Int Conf Intell Syst Mol Biol</source>
					<volume>6</volume>
					<fpage>165</fpage>
					<lpage>174</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b006"><label>6</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Brown</surname><given-names>D</given-names></name><name name-style="western"><surname>Krishnamurthy</surname><given-names>N</given-names></name><name name-style="western"><surname>Dale</surname><given-names>JM</given-names></name><name name-style="western"><surname>Christopher</surname><given-names>W</given-names></name><name name-style="western"><surname>Sjölander</surname><given-names>K</given-names></name></person-group>
					<year>2005</year>
					<article-title>Subfamily HMMs in functional genomics.</article-title>
					<source>Pac Symp Biocomput</source>
					<volume>10</volume>
					<fpage>322</fpage>
					<lpage>333</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b007"><label>7</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Sjölander</surname><given-names>K</given-names></name></person-group>
					<year>2004</year>
					<article-title>Phylogenomic inference of protein molecular function: Advances and challenges.</article-title>
					<source>Bioinformatics</source>
					<volume>20</volume>
					<fpage>170</fpage>
					<lpage>179</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b008"><label>8</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Brenner</surname><given-names>SE</given-names></name></person-group>
					<year>1999</year>
					<article-title>Errors in genome annotation.</article-title>
					<source>Trends Genet</source>
					<volume>15</volume>
					<fpage>132</fpage>
					<lpage>133</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b009"><label>9</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Galperin</surname><given-names>MY</given-names></name><name name-style="western"><surname>Koonin</surname><given-names>EV</given-names></name></person-group>
					<year>1998</year>
					<article-title>Sources of systematic error in functional annotation of genomes: Domain rearrangement, non-orthologous gene displacement and operon disruption.</article-title>
					<source>In Silico Biol</source>
					<volume>1</volume>
					<fpage>55</fpage>
					<lpage>67</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b010"><label>10</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Koski</surname><given-names>LB</given-names></name><name name-style="western"><surname>Golding</surname><given-names>GB</given-names></name></person-group>
					<year>2001</year>
					<article-title>The closest BLAST hit is often not the nearest neighbor.</article-title>
					<source>J Mol Evol</source>
					<volume>52</volume>
					<fpage>540</fpage>
					<lpage>542</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b011"><label>11</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Bateman</surname><given-names>A</given-names></name><name name-style="western"><surname>Coin</surname><given-names>L</given-names></name><name name-style="western"><surname>Durbin</surname><given-names>R</given-names></name><name name-style="western"><surname>Finn</surname><given-names>RD</given-names></name><name name-style="western"><surname>Hollich</surname><given-names>V</given-names></name><etal/></person-group>
					<year>2004</year>
					<article-title>The Pfam protein families database.</article-title>
					<source>Nucleic Acids Res</source>
					<volume>32</volume>
					<fpage>D138</fpage>
					<lpage>D141</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b012"><label>12</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Marchler-Bauer</surname><given-names>A</given-names></name><name name-style="western"><surname>Panchenko</surname><given-names>AR</given-names></name><name name-style="western"><surname>Shoemaker</surname><given-names>BA</given-names></name><name name-style="western"><surname>Thiessen</surname><given-names>PA</given-names></name><name name-style="western"><surname>Geer</surname><given-names>LY</given-names></name><etal/></person-group>
					<year>2002</year>
					<article-title>CDD: A database of conserved domain alignments with links to domain three-dimensional structure.</article-title>
					<source>Nucleic Acids Res</source>
					<volume>30</volume>
					<fpage>281</fpage>
					<lpage>283</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b013"><label>13</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Tatusov</surname><given-names>RL</given-names></name><name name-style="western"><surname>Galperin</surname><given-names>MY</given-names></name><name name-style="western"><surname>Natale</surname><given-names>DA</given-names></name><name name-style="western"><surname>Koonin</surname><given-names>EV</given-names></name></person-group>
					<year>2000</year>
					<article-title>The COG database: A tool for genome-scale analysis of protein functions and evolution.</article-title>
					<source>Nucleic Acids Res</source>
					<volume>28</volume>
					<fpage>33</fpage>
					<lpage>36</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b014"><label>14</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Zmasek</surname><given-names>CM</given-names></name><name name-style="western"><surname>Eddy</surname><given-names>SR</given-names></name></person-group>
					<year>2002</year>
					<article-title>RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs.</article-title>
					<source>BMC Bioinformatics</source>
					<volume>3</volume>
					<fpage>14</fpage>
				</element-citation></ref><ref id="pcbi-0020077-b015"><label>15</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Storm</surname><given-names>CE</given-names></name><name name-style="western"><surname>Sonnhammer</surname><given-names>EL</given-names></name></person-group>
					<year>2002</year>
					<article-title>Automated ortholog inference from phylogenetic trees and calculation of orthology reliability.</article-title>
					<source>Bioinformatics</source>
					<volume>18</volume>
					<fpage>92</fpage>
					<lpage>99</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b016"><label>16</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Krause</surname><given-names>A</given-names></name><name name-style="western"><surname>Vingron</surname><given-names>M</given-names></name></person-group>
					<year>1998</year>
					<article-title>A set-theoretic approach to database searching and clustering.</article-title>
					<source>Bioinformatics</source>
					<volume>14</volume>
					<fpage>430</fpage>
					<lpage>438</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b017"><label>17</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Yona</surname><given-names>G</given-names></name><name name-style="western"><surname>Linial</surname><given-names>N</given-names></name><name name-style="western"><surname>Linial</surname><given-names>M</given-names></name></person-group>
					<year>1999</year>
					<article-title>ProtoMap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space.</article-title>
					<source>Proteins</source>
					<volume>37</volume>
					<fpage>360</fpage>
					<lpage>378</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b018"><label>18</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Enright</surname><given-names>AJ</given-names></name><name name-style="western"><surname>Iliopoulos</surname><given-names>I</given-names></name><name name-style="western"><surname>Kyrpides</surname><given-names>NC</given-names></name><name name-style="western"><surname>Ouzounis</surname><given-names>CA</given-names></name></person-group>
					<year>1999</year>
					<article-title>Protein interaction maps for complete genomes based on gene fusion events.</article-title>
					<source>Nature</source>
					<volume>402</volume>
					<fpage>86</fpage>
					<lpage>90</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b019"><label>19</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Wicker</surname><given-names>N</given-names></name><name name-style="western"><surname>Perrin</surname><given-names>GR</given-names></name><name name-style="western"><surname>Thierry</surname><given-names>JC</given-names></name><name name-style="western"><surname>Poch</surname><given-names>O</given-names></name></person-group>
					<year>2001</year>
					<article-title>Secator: A program for inferring protein subfamilies from phylogenetic trees.</article-title>
					<source>Mol Biol Evol</source>
					<volume>18</volume>
					<fpage>1435</fpage>
					<lpage>1441</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b020"><label>20</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Abascal</surname><given-names>F</given-names></name><name name-style="western"><surname>Valencia</surname><given-names>A</given-names></name></person-group>
					<year>2002</year>
					<article-title>Clustering of proximal sequence space for the identification of protein families.</article-title>
					<source>Bioinformatics</source>
					<volume>18</volume>
					<fpage>908</fpage>
					<lpage>921</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b021"><label>21</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Dubey</surname><given-names>A</given-names></name><name name-style="western"><surname>Hwang</surname><given-names>S</given-names></name><name name-style="western"><surname>Rangel</surname><given-names>C</given-names></name><name name-style="western"><surname>Rasmussen</surname><given-names>CE</given-names></name><name name-style="western"><surname>Ghahramani</surname><given-names>Z</given-names></name><etal/></person-group>
					<year>2004</year>
					<article-title>Clustering protein sequence and structure space with infinite Gaussian mixture models.</article-title>
					<source>Pac Symp Biocomput</source>
					<volume>9</volume>
					<fpage>399</fpage>
					<lpage>410</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b022"><label>22</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Engelhardt</surname><given-names>BE</given-names></name><name name-style="western"><surname>Jordan</surname><given-names>MI</given-names></name><name name-style="western"><surname>Muratore</surname><given-names>KE</given-names></name><name name-style="western"><surname>Brenner</surname><given-names>SE</given-names></name></person-group>
					<year>2005</year>
					<article-title>Protein molecular function prediction by Bayesian phylogenomics.</article-title>
					<source>PLoS Comput Biol</source>
					<volume>1</volume>
					<elocation-id>e45.</elocation-id>
					<comment>DOI: <ext-link ext-link-type="doi" xlink:href="http://dx.doi.org/10.1371/journal.pcbi.0010045" xlink:type="simple">10.1371/journal.pcbi.0010045</ext-link></comment>
				</element-citation></ref><ref id="pcbi-0020077-b023"><label>23</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Li</surname><given-names>H</given-names></name><name name-style="western"><surname>Coghlan</surname><given-names>A</given-names></name><name name-style="western"><surname>Ruan</surname><given-names>J</given-names></name><name name-style="western"><surname>Coin</surname><given-names>LJ</given-names></name><name name-style="western"><surname>Heriche</surname><given-names>JK</given-names></name><etal/></person-group>
					<year>2006</year>
					<article-title>TreeFam: A curated database of phylogenetic trees of animal gene families.</article-title>
					<source>Nucleic Acids Res</source>
					<volume>34</volume>
					<fpage>D572</fpage>
					<lpage>D580</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b024"><label>24</label><element-citation publication-type="other" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Sjölander</surname><given-names>K</given-names></name></person-group>
					<year>2006</year>
					<source>Berkeley Phylogenomics Group Universal Proteome Explorer</source>
					<comment>Available: <ext-link ext-link-type="uri" xlink:href="http://phylogenomics.berkeley.edu/UniversalProteome/" xlink:type="simple">http://phylogenomics.berkeley.edu/UniversalProteome/</ext-link>. Accessed 29 May 2006.</comment>
				</element-citation></ref><ref id="pcbi-0020077-b025"><label>25</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Karchin</surname><given-names>R</given-names></name><name name-style="western"><surname>Karplus</surname><given-names>K</given-names></name><name name-style="western"><surname>Haussler</surname><given-names>D</given-names></name></person-group>
					<year>2002</year>
					<article-title>Classifying G-protein coupled receptors with support vector machines.</article-title>
					<source>Bioinformatics</source>
					<volume>18</volume>
					<fpage>147</fpage>
					<lpage>159</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b026"><label>26</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Weston</surname><given-names>J</given-names></name><name name-style="western"><surname>Leslie</surname><given-names>C</given-names></name><name name-style="western"><surname>Ie</surname><given-names>E</given-names></name><name name-style="western"><surname>Zhou</surname><given-names>D</given-names></name><name name-style="western"><surname>Elisseeff</surname><given-names>A</given-names></name><etal/></person-group>
					<year>2005</year>
					<article-title>Semi-supervised protein classification using cluster kernels.</article-title>
					<source>Bioinformatics</source>
					<volume>21</volume>
					<fpage>3241</fpage>
					<lpage>3247</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b027"><label>27</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Lopez</surname><given-names>P</given-names></name><name name-style="western"><surname>Casane</surname><given-names>D</given-names></name><name name-style="western"><surname>Philippe</surname><given-names>H</given-names></name></person-group>
					<year>2002</year>
					<article-title>Heterotachy, an important process of protein evolution.</article-title>
					<source>Mol Biol Evol</source>
					<volume>19</volume>
					<fpage>1</fpage>
					<lpage>7</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b028"><label>28</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Lockhart</surname><given-names>P</given-names></name><name name-style="western"><surname>Novis</surname><given-names>P</given-names></name><name name-style="western"><surname>Milligan</surname><given-names>BG</given-names></name><name name-style="western"><surname>Riden</surname><given-names>J</given-names></name><name name-style="western"><surname>Rambaut</surname><given-names>A</given-names></name><etal/></person-group>
					<year>2006</year>
					<article-title>Heterotachy and tree building: A case study with plastids and eubacteria.</article-title>
					<source>Mol Biol Evol</source>
					<volume>23</volume>
					<fpage>40</fpage>
					<lpage>45</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b029"><label>29</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Hillis</surname><given-names>DM</given-names></name><name name-style="western"><surname>Huelsenbeck</surname><given-names>JP</given-names></name><name name-style="western"><surname>Cunningham</surname><given-names>CW</given-names></name></person-group>
					<year>1994</year>
					<article-title>Application and accuracy of molecular phylogenies.</article-title>
					<source>Science</source>
					<volume>264</volume>
					<fpage>671</fpage>
					<lpage>677</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b030"><label>30</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Lio</surname><given-names>P</given-names></name><name name-style="western"><surname>Goldman</surname><given-names>N</given-names></name></person-group>
					<year>1998</year>
					<article-title>Models of molecular evolution and phylogeny.</article-title>
					<source>Genome Res</source>
					<volume>8</volume>
					<fpage>1233</fpage>
					<lpage>1244</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b031"><label>31</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Koshi</surname><given-names>JM</given-names></name><name name-style="western"><surname>Goldstein</surname><given-names>RA</given-names></name></person-group>
					<year>1998</year>
					<article-title>Models of natural mutations including site heterogeneity.</article-title>
					<source>Proteins</source>
					<volume>32</volume>
					<fpage>289</fpage>
					<lpage>295</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b032"><label>32</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Holmes</surname><given-names>I</given-names></name></person-group>
					<year>2003</year>
					<article-title>Using guide trees to construct multiple-sequence evolutionary HMMs.</article-title>
					<source>Bioinformatics</source>
					<volume>19</volume>
					<fpage>i147</fpage>
					<lpage>157</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b033"><label>33</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Scheeff</surname><given-names>ED</given-names></name><name name-style="western"><surname>Bourne</surname><given-names>PE</given-names></name></person-group>
					<year>2005</year>
					<article-title>Structural evolution of the protein kinase-like superfamily.</article-title>
					<source>PLoS Comput Biol</source>
					<volume>1</volume>
					<elocation-id>e49.</elocation-id>
					<comment>DOI: <ext-link ext-link-type="doi" xlink:href="http://dx.doi.org/10.1371/journal.pcbi.0010049" xlink:type="simple">10.1371/journal.pcbi.0010049</ext-link></comment>
				</element-citation></ref><ref id="pcbi-0020077-b034"><label>34</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Baker</surname><given-names>D</given-names></name><name name-style="western"><surname>Sali</surname><given-names>A</given-names></name></person-group>
					<year>2001</year>
					<article-title>Protein structure prediction and structural genomics.</article-title>
					<source>Science</source>
					<volume>294</volume>
					<fpage>93</fpage>
					<lpage>96</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b035"><label>35</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Edgar</surname><given-names>RC</given-names></name><name name-style="western"><surname>Sjölander</surname><given-names>K</given-names></name></person-group>
					<year>2003</year>
					<article-title>SATCHMO: Sequence alignment and tree construction using hidden Markov models.</article-title>
					<source>Bioinformatics</source>
					<volume>19</volume>
					<fpage>1404</fpage>
					<lpage>1411</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b036"><label>36</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Fleissner</surname><given-names>R</given-names></name><name name-style="western"><surname>Metzler</surname><given-names>D</given-names></name><name name-style="western"><surname>von Haeseler</surname><given-names>A</given-names></name></person-group>
					<year>2005</year>
					<article-title>Simultaneous statistical multiple alignment and phylogeny reconstruction.</article-title>
					<source>Syst Biol</source>
					<volume>54</volume>
					<fpage>548</fpage>
					<lpage>561</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b037"><label>37</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Lunter</surname><given-names>G</given-names></name><name name-style="western"><surname>Miklos</surname><given-names>I</given-names></name><name name-style="western"><surname>Drummond</surname><given-names>A</given-names></name><name name-style="western"><surname>Jensen</surname><given-names>JL</given-names></name><name name-style="western"><surname>Hein</surname><given-names>J</given-names></name></person-group>
					<year>2005</year>
					<article-title>Bayesian coestimation of phylogeny and sequence alignment.</article-title>
					<source>BMC Bioinformatics</source>
					<volume>6</volume>
					<fpage>83</fpage>
				</element-citation></ref><ref id="pcbi-0020077-b038"><label>38</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Mi</surname><given-names>H</given-names></name><name name-style="western"><surname>Vandergriff</surname><given-names>J</given-names></name><name name-style="western"><surname>Campbell</surname><given-names>M</given-names></name><name name-style="western"><surname>Narechania</surname><given-names>A</given-names></name><name name-style="western"><surname>Majoros</surname><given-names>W</given-names></name><etal/></person-group>
					<year>2003</year>
					<article-title>Assessment of genome-wide protein function classification for <named-content content-type="genus-species" xlink:type="simple">Drosophila melanogaster</named-content>.</article-title>
					<source>Genome Res</source>
					<volume>13</volume>
					<fpage>2118</fpage>
					<lpage>2128</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b039"><label>39</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Ashburner</surname><given-names>M</given-names></name><name name-style="western"><surname>Ball</surname><given-names>CA</given-names></name><name name-style="western"><surname>Blake</surname><given-names>JA</given-names></name><name name-style="western"><surname>Botstein</surname><given-names>D</given-names></name><name name-style="western"><surname>Butler</surname><given-names>H</given-names></name><etal/></person-group>
					<year>2000</year>
					<article-title>Gene Ontology: Tool for the unification of biology.</article-title>
					<source>Nat Genet</source>
					<volume>25</volume>
					<fpage>25</fpage>
					<lpage>29</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b040"><label>40</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Camon</surname><given-names>E</given-names></name><name name-style="western"><surname>Magrane</surname><given-names>M</given-names></name><name name-style="western"><surname>Barrell</surname><given-names>D</given-names></name><name name-style="western"><surname>Binns</surname><given-names>D</given-names></name><name name-style="western"><surname>Fleischmann</surname><given-names>W</given-names></name><etal/></person-group>
					<year>2003</year>
					<article-title>The Gene Ontology Annotation (GOA) project: Implementation of GO in SWISS-PROT, TrEMBL, and InterPro.</article-title>
					<source>Genome Res</source>
					<volume>13</volume>
					<fpage>662</fpage>
					<lpage>672</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b041"><label>41</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Rebholz-Schuhmann</surname><given-names>D</given-names></name><name name-style="western"><surname>Kirsch</surname><given-names>H</given-names></name><name name-style="western"><surname>Couto</surname><given-names>F</given-names></name></person-group>
					<year>2005</year>
					<article-title>Facts from text—Is text mining ready to deliver?</article-title>
					<source>PLoS Biol</source>
					<volume>3</volume>
					<elocation-id>e65.</elocation-id>
					<comment>DOI: <ext-link ext-link-type="doi" xlink:href="http://dx.doi.org/10.1371/journal.pbio.0030065" xlink:type="simple">10.1371/journal.pbio.0030065</ext-link></comment>
				</element-citation></ref><ref id="pcbi-0020077-b042"><label>42</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Rost</surname><given-names>B</given-names></name></person-group>
					<year>2002</year>
					<article-title>Enzyme function less conserved than anticipated.</article-title>
					<source>J Mol Biol</source>
					<volume>318</volume>
					<fpage>595</fpage>
					<lpage>608</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b043"><label>43</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Babbitt</surname><given-names>PC</given-names></name></person-group>
					<year>2003</year>
					<article-title>Definitions of enzyme function for the structural genomics era.</article-title>
					<source>Curr Opin Chem Biol</source>
					<volume>7</volume>
					<fpage>230</fpage>
					<lpage>237</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b044"><label>44</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Brown</surname><given-names>SD</given-names></name><name name-style="western"><surname>Gerlt</surname><given-names>JA</given-names></name><name name-style="western"><surname>Seffernick</surname><given-names>JL</given-names></name><name name-style="western"><surname>Babbitt</surname><given-names>PC</given-names></name></person-group>
					<year>2006</year>
					<article-title>A gold standard set of mechanistically diverse enzyme superfamilies.</article-title>
					<source>Genome Biol</source>
					<volume>7</volume>
					<fpage>R8</fpage>
				</element-citation></ref><ref id="pcbi-0020077-b045"><label>45</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Brenner</surname><given-names>SE</given-names></name><name name-style="western"><surname>Koehl</surname><given-names>P</given-names></name><name name-style="western"><surname>Levitt</surname><given-names>M</given-names></name></person-group>
					<year>2000</year>
					<article-title>The ASTRAL compendium for protein structure and sequence analysis.</article-title>
					<source>Nucleic Acids Res</source>
					<volume>28</volume>
					<fpage>254</fpage>
					<lpage>256</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b046"><label>46</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Lo Conte</surname><given-names>L</given-names></name><name name-style="western"><surname>Ailey</surname><given-names>B</given-names></name><name name-style="western"><surname>Hubbard</surname><given-names>TJ</given-names></name><name name-style="western"><surname>Brenner</surname><given-names>SE</given-names></name><name name-style="western"><surname>Murzin</surname><given-names>AG</given-names></name><etal/></person-group>
					<year>2000</year>
					<article-title>SCOP: A structural classification of proteins database.</article-title>
					<source>Nucleic Acids Res</source>
					<volume>28</volume>
					<fpage>257</fpage>
					<lpage>259</lpage>
				</element-citation></ref><ref id="pcbi-0020077-b047"><label>47</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Moult</surname><given-names>J</given-names></name><name name-style="western"><surname>Fidelis</surname><given-names>K</given-names></name><name name-style="western"><surname>Rost</surname><given-names>B</given-names></name><name name-style="western"><surname>Hubbard</surname><given-names>T</given-names></name><name name-style="western"><surname>Tramontano</surname><given-names>A</given-names></name></person-group>
					<year>2005</year>
					<article-title>Critical assessment of methods of protein structure prediction (CASP)—Round 6.</article-title>
					<source>Proteins</source>
					<volume>61</volume>
					<fpage>3</fpage>
					<lpage>7</lpage>
				</element-citation></ref></ref-list></back></article>