<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article
  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article" dtd-version="3.0" xml:lang="EN"><front><journal-meta><journal-id journal-id-type="publisher-id">plos</journal-id><journal-id journal-id-type="publisher">pcbi</journal-id><journal-id journal-id-type="flc">plcb</journal-id><journal-id journal-id-type="nlm-ta">PLoS Comput Biol</journal-id><journal-id journal-id-type="pmc">ploscomp</journal-id><!--===== Grouping journal title elements =====--><journal-title-group><journal-title>PLoS Computational Biology</journal-title></journal-title-group><issn pub-type="ppub">1553-734X</issn><issn pub-type="epub">1553-7358</issn><publisher><publisher-name>Public Library of Science</publisher-name><publisher-loc>San Francisco, USA</publisher-loc></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.1371/journal.pcbi.0020046</article-id><article-id pub-id-type="publisher-id">05-PLCB-RA-0323R2</article-id><article-id pub-id-type="sici">plcb-02-05-03</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="Discipline"><subject>Computational Biology</subject><subject>Evolutionary Biology</subject><subject>Molecular Biology</subject></subj-group><subj-group subj-group-type="System Taxonomy"><subject>Homo (human)</subject><subject>Mus (mouse)</subject><subject>Primates</subject></subj-group></article-categories><title-group><article-title>Genome-Wide Survey for Biologically Functional Pseudogenes</article-title><alt-title alt-title-type="running-head">Survey for Functional Pseudogenes</alt-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Svensson</surname><given-names>Örjan</given-names></name><xref ref-type="corresp" rid="cor1">
            <sup>*</sup>
          </xref><xref ref-type="aff" rid="aff1"/></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Arvestad</surname><given-names>Lars</given-names></name><xref ref-type="aff" rid="aff1"/></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Lagergren</surname><given-names>Jens</given-names></name><xref ref-type="aff" rid="aff1"/></contrib></contrib-group><aff id="aff1">
        <label>1</label>
        <addr-line>Stockholm Bioinformatics Centre, Royal Institute of Technology, Albanova University Center, Stockholm, Sweden</addr-line>
      </aff><contrib-group><contrib contrib-type="editor" xlink:type="simple"><name name-style="western"><surname>Pilpel</surname><given-names>Yitzhak</given-names></name><role>Editor</role><xref ref-type="aff" rid="edit1"/></contrib></contrib-group><aff id="edit1">Weizmann Institute of Science, Israel</aff><author-notes><fn fn-type="con" id="ack1"><p>ÖS, LA, and JL conceived and designed the experiments. ÖS performed the experiments. ÖS, LA, and JL analyzed the data. ÖS contributed reagents/materials/analysis tools. ÖS, LA, and JL wrote the paper.</p></fn><corresp id="cor1">* To whom correspondence should be addressed. E-mail: <email xlink:type="simple">osv@sbc.su.se</email></corresp><fn fn-type="conflict" id="ack3"><p> The authors have declared that no competing interests exist.</p></fn></author-notes><pub-date pub-type="ppub"><month>5</month><year>2006</year></pub-date><pub-date pub-type="epub"><day>5</day><month>5</month><year>2006</year></pub-date><pub-date pub-type="epreprint"><day>24</day><month>3</month><year>2006</year></pub-date><volume>2</volume><issue>5</issue><elocation-id>e46</elocation-id><history><date date-type="received"><day>14</day><month>11</month><year>2005</year></date><date date-type="accepted"><day>23</day><month>3</month><year>2006</year></date></history><!--===== Grouping copyright info into permissions =====--><permissions><copyright-year>2006</copyright-year><copyright-holder>Svensson et al</copyright-holder><license><license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</license-p></license></permissions><abstract><p>According to current estimates there exist about 20,000 pseudogenes in a mammalian genome. The vast majority of these are disabled and nonfunctional copies of protein-coding genes which, therefore, evolve neutrally. Recent findings that a Makorin1 pseudogene, residing on mouse Chromosome 5, is, indeed, in vivo vital and also evolutionarily preserved, encouraged us to conduct a genome-wide survey for other functional pseudogenes in human, mouse, and chimpanzee. We identify to our knowledge the first examples of conserved pseudogenes common to human and mouse, originating from one duplication predating the human–mouse species split and having evolved as pseudogenes since the species split. Functionality is one possible way to explain the apparently contradictory properties of such pseudogene pairs, i.e., high conservation and ancient origin. The hypothesis of functionality is tested by comparing expression evidence and synteny of the candidates with proper test sets. The tests suggest potential biological function. Our candidate set includes a small set of long-lived pseudogenes whose unknown potential function is retained since before the human–mouse species split, and also a larger group of primate-specific ones found from human–chimpanzee searches. Two processed sequences are notable, their conservation since the human–mouse split being as high as most protein-coding genes; one is derived from the protein Ataxin 7-like 3 (ATX7NL3), and one from the Spinocerebellar ataxia type 1 protein (ATX1). Our approach is comparative and can be applied to any pair of species. It is implemented by a semi-automated pipeline based on cross-species BLAST comparisons and maximum-likelihood phylogeny estimations. To separate pseudogenes from protein-coding genes, we use standard methods, utilizing in-frame disablements, as well as a probabilistic filter based on Ka/Ks ratios.</p></abstract><abstract abstract-type="synopsis"><title>Synopsis</title><p>Svensson, Arvestad, and Lagergren conducted a genome-wide survey for and analysis of human pseudogenes, i.e., gene copies with lost protein-coding ability, with the aim of discovering biologically functional ones. Their main motivation was a 2002 <italic>Nature</italic> paper revealing in vivo functionality for a mouse Makorin pseudogene, Makorin1-p1. Their work is in line with extensive research in recent years concerning ncRNA. The method consists of a BLAST-based pipeline augmented by modern maximum-likelihood phylogeny estimations. Several examples of unknown genes and present in silico tests favoring the hypothesis that these are functional pseudogenes were found. In the result set, there are two examples from the Ataxin family; a poorly characterized gene family which, however, includes a number of genes related to neurogenerative disorders. A discovery of new members in this gene family should be of great interest to experimentalists in the field. To the best of our knowledge, functional pseudogenes have never been observed in humans. The results suggest, however, that while functional pseudogenes are relatively rare on a long evolutionary timescale, they nevertheless exist. These deserve attention, of course, similar to any other previously uncharacterised gene.</p></abstract><funding-group><funding-statement>This work was supported by the Swedish Research Council.</funding-statement></funding-group><counts><page-count count="12"/></counts><!--===== Restructure custom-meta-wrap to custom-meta-group =====--><custom-meta-group><custom-meta><meta-name>Citation:</meta-name><meta-value>Svensson Ö, Arvestad L, Lagergren J (2006) Genome-wide survey for biologically functional pseudogenes. PLoS Comput Biol 2(5): e46. DOI: <ext-link ext-link-type="doi" xlink:href="http://dx.doi.org/10.1371/journal.pcbi.0020046" xlink:type="simple">10.1371/journal.pcbi.0020046</ext-link></meta-value></custom-meta></custom-meta-group></article-meta></front><body><sec id="s1"><title>Introduction</title><p>Pseudogenes are sequences of genomic DNA lacking the protein-coding capability of their paralogous counterpart [<xref ref-type="bibr" rid="pcbi-0020046-b001">1</xref>,<xref ref-type="bibr" rid="pcbi-0020046-b002">2</xref>]. A pseudogene can be arbitrarily similar to the original gene, but differ by the fact that it accumulates <italic>disablements</italic> (in-frame stop codons and sequence frameshifts), which protein-coding genes do not. Because pseudogenes are not protein-coding, they are often thought of as being without function and therefore released from selective pressure. The origin of a pseudogene is generally either a segmental duplication, or a retrotransposition where mature mRNA is reversely transcribed into cDNA and reinserted in a new genomic position. The resulting pseudogene is in the latter case called <italic>processed</italic> as compared with duplicated or <italic>nonprocessed</italic>.</p><p>Studies of pseudogene populations are often motivated by the dilemma that their similarity to ordinary genes constitutes for gene finders and hybridization experiments. Pseudogene sequences can, given their nonfunctionality, be viewed as a molecular fossil and have been used to measure background genomic substitution rates [<xref ref-type="bibr" rid="pcbi-0020046-b003">3</xref>,<xref ref-type="bibr" rid="pcbi-0020046-b004">4</xref>].</p><p>However, evidence has occasionally been found, in <italic>Drosophila</italic> and recently also in mouse [<xref ref-type="bibr" rid="pcbi-0020046-b005">5</xref>], of pseudogene functionality, as well as of conservation (see [<xref ref-type="bibr" rid="pcbi-0020046-b006">6</xref>] for a review). In [<xref ref-type="bibr" rid="pcbi-0020046-b005">5</xref>], evidence is given for a regulatory role of the mouse Makorin1 pseudogene <italic>Makorin1-p1</italic>. It was proposed in [<xref ref-type="bibr" rid="pcbi-0020046-b005">5</xref>] that the function of the transcribed pseudogene is to stabilize the Makorin1 mRNA. A follow-up study [<xref ref-type="bibr" rid="pcbi-0020046-b007">7</xref>] established that Makorin1-p1 is in fact conserved across several mouse species, although it is not found in more distantly related species such as rat or human.</p><p>Several surveys [<xref ref-type="bibr" rid="pcbi-0020046-b008">8</xref>–<xref ref-type="bibr" rid="pcbi-0020046-b010">10</xref>] have located and annotated pseudogenes in the human and mouse genomes. Despite using slightly different pseudogene definitions and methodologies for finding them, they end up with similar numbers of human pseudogenes (altogether about 20,000 sequences out of which some 8,000 show evidence of processing). The authors of [<xref ref-type="bibr" rid="pcbi-0020046-b011">11</xref>] used more restrictive criteria, and identified about 3,600 human processed pseudogenes. The main theme for these studies is that sequences sufficiently similar to a known protein sequence are considered potential pseudogenes. The final classification as pseudogene is based on proof of sequence disablements (primarily in-frame stop codons and sequence frameshifts), Ka/Ks values indicating neutral evolution, and, importantly, that the sequences are not overlapping any known gene.</p><p>In a recent article [<xref ref-type="bibr" rid="pcbi-0020046-b012">12</xref>], the authors went further and looked specifically for human-transcribed processed pseudogenes. They found that some 4%–6% of all human processed pseudogenes could be confidently mapped to sequences in expression databases. The same group then continued with a more careful annotation of pseudogenes on Chromosome 22, utilising tiling microarray technology [<xref ref-type="bibr" rid="pcbi-0020046-b013">13</xref>], concluding that this rate was probably an underestimate and that maybe as much as 1/5 of all pseudogenes are transcribed. Another investigation in the same spirit [<xref ref-type="bibr" rid="pcbi-0020046-b014">14</xref>] found that the percentage of expressed pseudogenes differ significantly between human and mouse. They reported 2%–3% and 0.5%–1%, respectively, using the most restrictive criteria.</p><p>A human–mouse comparative study [<xref ref-type="bibr" rid="pcbi-0020046-b012">12</xref>] concludes that the vast majority of transcribed human pseudogenes are lineage specific. Only some 5% were found to have potential orthologs in mouse.</p><p>That a pseudogene is transcribed is not sufficient evidence of biological function. To obtain functional candidates, we decided to look for conserved pseudogenes common to human and mouse, originating from one duplication predating the human–mouse species split and having evolved as pseudogenes since the species split. In cases where the species split occurred sufficiently early, strong conservation and ancient origin gives evidence of the potential functionality of the pseudogenes. We have developed a pairwise comparative genomics methodology based on an explicit evolutionary model, which focuses on pseudogenes common to the two lineages. We also test the potential functionality of the found pseudogenes using enrichment of transcription and synteny.</p><p>We describe our methodology using the example of a human–mouse comparison. Our procedure takes as input a quartet of sequences representing, respectively, a human gene, a corresponding human pseudogene, the orthologous mouse gene, and a corresponding mouse pseudogene, and analyzes how they have evolved. All four basic evolutionary scenarios that can occur with respect to duplication and gene-to-pseudogene transitions are described below. When analyzing how well a scenario describes the evolution of a quartet, different models of sequence evolution are used for gene and pseudogene lineages.</p><p>The first scenario, <italic>S</italic>1, is what we expect for conserved pseudogenes originating before the species split (see <xref ref-type="fig" rid="pcbi-0020046-g001">Figure 1</xref>). </p><fig id="pcbi-0020046-g001" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020046.g001</object-id><label>Figure 1</label><caption><title>Evolutionary Scenario <italic>S</italic>1, Describing the Case where the Pseudogene Originated before the Species Split and Has Acquired as well as Maintained Function</title><p>G and <italic>ψ</italic> on tree branches refer to gene and pseudogene evolution, respectively.</p></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.g001" xlink:type="simple"/></fig><p>An alternative scenario, <italic>S</italic>2, is expected if both pseudogenes originated independently of each other, after the species split (see <xref ref-type="fig" rid="pcbi-0020046-g002">Figure 2</xref>). In our human–mouse comparison, the evolution of most quartets are best described by <italic>S</italic>2. A likely explanation for this is that dead-on-arrival pseudogenes [<xref ref-type="bibr" rid="pcbi-0020046-b015">15</xref>] originating before the human–mouse species split have most often diverged beyond the limit of recognition. With approximately 0.5 substitutions per site, fewer than 10% of neutrally evolving genomic elements can be found using BLAST [<xref ref-type="bibr" rid="pcbi-0020046-b016">16</xref>]. </p><fig id="pcbi-0020046-g002" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020046.g002</object-id><label>Figure 2</label><caption><title>Evolutionary Scenario <italic>S</italic>2, Describing the Common Case of Late and Independent Pseudogene Origin</title></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.g002" xlink:type="simple"/></fig><p>The third scenario, <italic>S</italic>3 (see <xref ref-type="fig" rid="pcbi-0020046-g003">Figure 3</xref>), is similar to <italic>S</italic>1. The difference is that the transition from gene to pseudogene occurred subsequent to the species split. This could mean that a pair of pseudogenes was in fact functional genes prior to the transition, but has since then evolved without selective pressure.</p><fig id="pcbi-0020046-g003" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020046.g003</object-id><label>Figure 3</label><caption><title>Evolutionary Scenario <italic>S</italic>3, Describing Independent Transitions</title></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.g003" xlink:type="simple"/></fig><p>A fourth scenario where the human gene has the mouse pseudogene as a sibling in the gene tree is conceivable. We have never observed this scenario. </p><p>We have applied our comparative methodology to human–mouse as well as to human–chimpanzee and found the first examples of human pseudogenes showing signs of functionality.</p></sec><sec id="s2"><title>Results</title><p>We started with the 12,687 presumably orthologous protein pairs retrieved (see <xref ref-type="sec" rid="s4">Materials and Methods</xref>) from the Inparanoid web site [<xref ref-type="bibr" rid="pcbi-0020046-b017">17</xref>] for which gene sequences could be found. We used BLAST to scan the human and mouse genomes for potential pseudogenic sequence pairs (see <xref ref-type="sec" rid="s4">Materials and Methods</xref>). A pseudogene pair corresponding to a protein pair was then used together with the gene sequences to form the sequence quartets on which we base our analysis.</p><p>This initial search with subsequent refinement resulted in 168,855 such quartets. For the vast majority of these quartets, one or both pseudogene sequences overlap regions of <italic>known</italic> or <italic>predicted</italic> protein-coding genes. Gene position data from Ensembl were used to filter out known genes. Predicted genes are kept for further analysis, since it is known [<xref ref-type="bibr" rid="pcbi-0020046-b008">8</xref>] that gene predictors sometimes mistake pseudogenes for protein-coding genes.</p><p>The set that remains after filtering constitutes 11,146 sequence quartets originating from 1,349 protein pairs. The distribution of quartets per protein pair is highly nonuniform. While many gene pairs lack corresponding pseudogene pairs, a handful (EF12, G3PT, LDHB, TSY1, UB46, and several ribosomal genes) are origins of large pseudogene families in both species. Using the mutual-best-hit filtering outlined in <xref ref-type="fig" rid="pcbi-0020046-g004">Figure 4</xref>, we, however, removed a large number of pairs likely to be insignificant; after this step 1,453 sequence quartets remained. We divide these into four classes according to the following: class 1—pairs that have detectable disablements and do not overlap any Ensembl gene prediction; class 2—pairs that have detectable disablements and overlap an Ensembl gene prediction; class 3—pairs that do not have detectable pseudogenic disablements and do not overlap any Ensembl gene prediction; and class 4—pairs that do not have detectable pseudogenic disablements and overlap an Ensembl gene prediction.</p><fig id="pcbi-0020046-g004" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020046.g004</object-id><label>Figure 4</label><caption><title>Visualization of the Effect of Mutual-Best-Hit Filtering</title><p>The tree shows the evolutionary history for a sequence set associated with the ATXN7L3 orthologous proteins. We have here found two potentially pseudogenic sequences in each species and this gives us a total of four quartets to investigate; the gene-sequence pair together with any human–mouse combination of the pseudogenes. It is unlikely that the human chrX pseudogene is closely related to any of the mouse ones and therefore any quartet including the sequence from the X chromosome should be of limited interest. If we pair a particular human pseudogene only with the most similar mouse pseudogene (and vice versa), the sole remaining example is the human chr12–mouse chr10 pair.</p></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.g004" xlink:type="simple"/></fig><p>We used the partition of our data induced by this classification in combination with mutual-best-hit filtering. The number of sequence quartets belonging to the classes are: class 1—247 quartets; class 2—299; class 3—146; and class 4—761 (see Table 1).</p><p>Our aim is to find those quartets for which <italic>S</italic>1 is the most likely explanation. We use a probabilistic methodology to compare scenarios (see <xref ref-type="sec" rid="s4">Materials and Methods</xref>), to obtain <italic>p</italic>-values for any possible alternative hypothesis with respect to the interesting one, namely, that <italic>S</italic>1 best describes a given quartet. For visualization purposes, we also consider quotients of type <italic>L</italic><sub>1</sub>/<italic>L</italic><sub>2</sub>, where <italic>L<sub>i</sub></italic> is the log-likelihood corresponding to scenario <italic>i;</italic> for a particular quartet, a value <italic>L</italic><sub>1</sub>/<italic>L</italic><sub>2</sub> &lt; 1 suggests that L<sub>1</sub> is preferable to <italic>L</italic><sub>2</sub>.</p><p>For the majority of our 1,453 quartets, data support <italic>S</italic>2 (<xref ref-type="fig" rid="pcbi-0020046-g005">Figure 5</xref>), the scenario where pseudogenes originated later than the species split. In 425 out of 1,453 cases (29%), the <italic>p</italic>-value for <italic>S</italic>1 being the scenario that best explains our data is less than 0.001.</p><fig id="pcbi-0020046-g005" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020046.g005</object-id><label>Figure 5</label><caption><title>Histogram of Likelihood Quotients when Comparing Scenarios <italic>S</italic>1 and <italic>S</italic>2</title></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.g005" xlink:type="simple"/></fig><p>Interestingly, we note a bimodal pattern with one large hump distributed around 1.1 and another one distributed around 0.9. That is, in a large majority of cases, data show clear preference for <italic>either S</italic>1 <italic>or S</italic>2; it is only for a comparatively small number of cases that the quotients are close to 1.</p><p>We now use the same technique to compare <italic>S</italic>1 and <italic>S</italic>3. Remember that <italic>S</italic>3 is the scenario where the transitions from genes to pseudogenes were independent of each other and occurred <italic>subsequently</italic> to the speciation. Hence it is only the models of sequence evolution used for genes and pseudogenes, respectively, that distinguish <italic>S</italic>1 from <italic>S</italic>3. The likelihood values are in this case much less varied, yielding many quotients close to one (<xref ref-type="fig" rid="pcbi-0020046-g006">Figure 6</xref>).</p><fig id="pcbi-0020046-g006" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020046.g006</object-id><label>Figure 6</label><caption><title>Histogram of Likelihood Quotients when Comparing Scenarios <italic>S</italic>1 and <italic>S</italic>3</title></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.g006" xlink:type="simple"/></fig><p>For 73 of the 425 quartets, <italic>S</italic>1 is the explanation favored by our method and for 30 of these 73 the <italic>p</italic>-value is lower than 0.1. For 352 sequence pairs, <italic>S</italic>3 is the most likely topology, and 262 of those clearly favor <italic>S</italic>3 (<italic>p</italic>-value again lower than 0.1).</p><p>To summarize, we have 30 quartets for which the sequences suggest that: 1) the pseudogenes are evolutionarily conserved since before the human and mouse speciation; 2) they have been pseudogenes since prior to the speciation.</p><p>Because we find 30 such quartets, and the number of quartets expected to pass our scenario test is 1453 * 0.001 * 0.1 ≈ 0.15, it is reasonable to conclude that a significant number of these 30 quartets are ancient pseudogenes, i.e., satisfying 1) and 2). </p><p>We are now going to investigate these 30 sequence quartets further, with the aim of testing their potential biological function. The criteria that will be our focus are synteny, expression evidence, and conservation.</p><sec id="s2a"><title>Synteny</title><p>Synteny can be used as a means to evaluate our methodology's capacity to separate S1 and S3 from S2 quartets. It is also interesting to compare the fraction of syntenic quartets among S1, S3, and genes. The latter can be seen as a test of functionality.</p><p>It has long been known that eukaryotic genomes undergo rearrangements on both microscopic (intrachromosomal with a span &lt; 1 Mb) and macroscopic (intrachromosomal with larger span, as well as interchromosomal) level during evolution [<xref ref-type="bibr" rid="pcbi-0020046-b018">18</xref>]. By using so-called sequence markers, often protein-coding segments, it has been possible to infer maps of syntenic regions, that is, regions of conserved marker order.</p><p>The orthologous pairs of protein-coding genes in our data set have the following synteny relations: 69% syntenic, 2% reversed syntenic, 11% corresponding chromosomes, 4% nonsyntenic, and 13% unknown synteny (see <xref ref-type="sec" rid="s4">Materials and Methods</xref>). We find 20 out of the 30 <italic>S</italic>1 pairs in synteny and five are found close to synteny (<xref ref-type="sec" rid="s4">Materials and Methods</xref>).</p><p>It is reasonable that sequences that have originated from duplication events prior to the species split (sequences belonging to <italic>S</italic>1 and <italic>S</italic>3 quartets) are primarily found in syntenic positions, as we have seen is the case for genes. Conversely, there is no reason to presuppose this for quartets showing preference for <italic>S</italic>2 (remember that the pseudogenes here are expected to have originated independently of each other). From inspection of <xref ref-type="table" rid="pcbi-0020046-t002">Table 2</xref>, we see the following: out of our 30 <italic>S</italic>1 sequence pairs, 20 are syntenic (67%). A similar amount, 149 out of 262 (57%), of <italic>S</italic>3 sequences are found in syntenic regions. Only 130 out of the total 702 (19%) <italic>S</italic>2 sequence pairs are syntenic. In fact, one could argue that the latter percentage is unexpectedly high. A possible explanation is that these are a result of tandem duplications; that is, the duplicated sequences are found nearby the original ones and are therefore in synteny. The tendency for class 4 to be found in syntenic regions could simply be due to the fact that these are detected by comparative gene finders, which often use synteny as a criterion [<xref ref-type="bibr" rid="pcbi-0020046-b019">19</xref>].</p><table-wrap id="pcbi-0020046-t001" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020046.t001</object-id><label>Table 1</label><caption><p>Number of Human–Mouse Sequence Pairs prior to and following Mutual-Best-Hit Filtering</p></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.t001" xlink:type="simple"/><!-- <table frame="hsides" rules="none"><colgroup><col id="tb1col1" align="left" charoff="0" char=""/><col id="tb1col2" align="char" charoff="0" char="."/><col id="tb1col3" align="char" charoff="0" char="."/></colgroup><thead><tr><td align="left"><hr/>Class</td><td><hr/>Before Filtering</td><td><hr/>After Filtering</td></tr></thead><tbody><tr><td>Class 1</td><td>3526</td><td>247</td></tr><tr><td>Class 2</td><td>2729</td><td>299</td></tr><tr><td>Class 3</td><td>937</td><td>146</td></tr><tr><td>Class 4</td><td>3954</td><td>761</td></tr><tr><td>Total</td><td>11146</td><td>1453</td></tr></tbody></table> --><!-- --></table-wrap><table-wrap content-type="2col" id="pcbi-0020046-t002" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020046.t002</object-id><label>Table 2</label><caption><p>Number of Sequence Pairs in Each Class Favoring a Particular Scenario</p></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.t002" xlink:type="simple"/><!-- <table frame="hsides" rules="none"><colgroup><col id="tb2col1" align="left" charoff="0" char=""/><col id="tb2col2" align="char" charoff="0" char="."/><col id="tb2col3" align="char" charoff="0" char="."/><col id="tb2col4" align="char" charoff="0" char="."/><col id="tb2col5" align="char" charoff="0" char="."/><col id="tb2col6" align="char" charoff="0" char="."/><col id="tb2col7" align="char" charoff="0" char="."/><col id="tb2col8" align="char" charoff="0" char="."/><col id="tb2col9" align="char" charoff="0" char="."/><col id="tb2col10" align="char" charoff="0" char="."/><col id="tb2col11" align="char" charoff="0" char="."/><col id="tb2col12" align="char" charoff="0" char="."/><col id="tb2col13" align="char" charoff="0" char="."/><col id="tb2col14" align="char" charoff="0" char="."/><col id="tb2col15" align="char" charoff="0" char="."/><col id="tb2col16" align="char" charoff="0" char="."/><col id="tb2col17" align="char" charoff="0" char="."/><col id="tb2col18" align="char" charoff="0" char="."/><col id="tb2col19" align="char" charoff="0" char="."/><col id="tb2col20" align="char" charoff="0" char="."/><col id="tb2col21" align="char" charoff="0" char="."/><col id="tb2col22" align="char" charoff="0" char="."/><col id="tb2col23" align="char" charoff="0" char="."/><col id="tb2col24" align="char" charoff="0" char="."/><col id="tb2col25" align="char" charoff="0" char="."/><col id="tb2col26" align="char" charoff="0" char="."/><col id="tb2col27" align="char" charoff="0" char="."/><col id="tb2col28" align="char" charoff="0" char="."/><col id="tb2col29" align="char" charoff="0" char="."/><col id="tb2col30" align="char" charoff="0" char="."/><col id="tb2col31" align="char" charoff="0" char="."/></colgroup><thead><tr><td align="left" rowspan="2"><hr/>Scenario</td><td colspan="6"><hr/>Class 1</td><td colspan="6"><hr/>Class 2</td><td colspan="6"><hr/>Class 3</td><td colspan="6"><hr/>Class 4</td><td colspan="6"><hr/>Total</td></tr><tr><td><hr/>S</td><td><hr/>R</td><td><hr/>C</td><td><hr/>U</td><td><hr/>N</td><td><hr/>Total</td><td><hr/>S</td><td><hr/>R</td><td><hr/>C</td><td><hr/>U</td><td><hr/>N</td><td><hr/>Total</td><td><hr/>S</td><td><hr/>R</td><td><hr/>C</td><td><hr/>U</td><td><hr/>N</td><td><hr/>Total</td><td><hr/>S</td><td><hr/>R</td><td><hr/>C</td><td><hr/>U</td><td><hr/>N</td><td><hr/>Total</td><td><hr/>S</td><td><hr/>R</td><td><hr/>C</td><td><hr/>U</td><td><hr/>N</td><td><hr/>Total</td></tr></thead><tbody><tr><td>S1</td><td>4</td><td>0</td><td>0</td><td>0</td><td>2</td><td><bold>6</bold></td><td>2</td><td>0</td><td>0</td><td>0</td><td>0</td><td><bold>2</bold></td><td>4</td><td>0</td><td>0</td><td>0</td><td>1</td><td><bold>5</bold></td><td>10</td><td>0</td><td>5</td><td>2</td><td>0</td><td><bold>17</bold></td><td>20</td><td>0</td><td>5</td><td>2</td><td>3</td><td><bold>30</bold></td></tr><tr><td>S2</td><td>2</td><td>3</td><td>1</td><td>17</td><td>137</td><td><bold>170</bold></td><td>18</td><td>2</td><td>11</td><td>19</td><td>110</td><td><bold>160</bold></td><td>10</td><td>2</td><td>8</td><td>9</td><td>26</td><td><bold>55</bold></td><td>100</td><td>7</td><td>20</td><td>33</td><td>157</td><td><bold>317</bold></td><td>130</td><td>14</td><td>50</td><td>78</td><td>430</td><td><bold>702</bold></td></tr><tr><td>S3</td><td>2</td><td>0</td><td>4</td><td>6</td><td>14</td><td><bold>26</bold></td><td>23</td><td>1</td><td>6</td><td>2</td><td>14</td><td><bold>46</bold></td><td>18</td><td>0</td><td>3</td><td>4</td><td>3</td><td><bold>28</bold></td><td>106</td><td>4</td><td>25</td><td>6</td><td>21</td><td><bold>162</bold></td><td>149</td><td>5</td><td>38</td><td>18</td><td>52</td><td><bold>262</bold></td></tr></tbody></table> --><!-- <table-wrap-foot><fn id="nt201"><p>For each scenario and class, the number of sequence pairs that are syntenic (S), reversed syntenic (R), close to synteny (C), with unknown synteny (U), nonsyntenic (N) and total (bold). <italic>p</italic>-Values used are 0.001 to distinguish <italic>S</italic>1 and <italic>S</italic>3 from <italic>S</italic>2 examples, and 0.1 to separate <italic>S</italic>1 from <italic>S</italic>3.</p></fn></table-wrap-foot> --></table-wrap><p>It is notable that within classes 1, 2, and 3, ten of the 13 <italic>S</italic>1 pseudogene pairs are syntenic, while only 43 out of 100 <italic>S</italic>3 pairs are syntenic.</p><p>If we again consider the ATXN7L3 tree (<xref ref-type="fig" rid="pcbi-0020046-g004">Figure 4</xref>), we see that among the pseudogenes only the human Chr12/mouse Chr10 pair, the pair retained after mutual-best-hit filtering, is found in syntenic positions.</p></sec><sec id="s2b"><title>Pseudogene Expression</title><p>We investigated whether our candidates for potential function are enriched for transcription or not, by searching publicly available databases for transcript sequences, expressed sequence tags (ESTs) and mRNAs. An EST or mRNA sequence is postulated to come from a specific pseudogene if its sequence is more similar to the pseudogene than it is to any other known gene or pseudogene (see <xref ref-type="sec" rid="s4">Materials and Methods</xref> for details). We found that, out of the 30 sequence pairs showing preference for <italic>S</italic>1, 22 are transcribed in both human and mouse. For the 20 syntenic <italic>S</italic>1 sequence pairs, 17 are transcribed in both species and all but one are transcribed in either human or mouse (see <xref ref-type="table" rid="pcbi-0020046-t003">Table 3</xref>). Notable are the ATX1 and ATXN7L3 duplicates, both class 1 members, for which we find ESTs from many different tissues (human thyroid, colon, and prostate among them), and also class 3 ZNF629 duplicates, each perfectly matched by approximately 1,000-bp-long mRNAs.</p><table-wrap id="pcbi-0020046-t003" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020046.t003</object-id><label>Table 3</label><caption><p><italic>p</italic>-Values for Scenario Comparisons and Pseudogene Expression Evidence (Number of Matching EST and mRNA Sequences) for the 20 Syntenic <italic>S</italic>1 Quartets</p></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.t003" xlink:type="simple"/><!-- <table frame="hsides" rules="none"><colgroup><col id="tb3col1" align="left" charoff="0" char=""/><col id="tb3col2" align="char" charoff="0" char="."/><col id="tb3col3" align="char" charoff="0" char="."/><col id="tb3col4" align="char" charoff="0" char="."/><col id="tb3col5" align="char" charoff="0" char="."/><col id="tb3col6" align="char" charoff="0" char="."/><col id="tb3col7" align="char" charoff="0" char="."/><col id="tb3col8" align="char" charoff="0" char="."/><col id="tb3col9" align="char" charoff="0" char="."/><col id="tb3col10" align="char" charoff="0" char="."/></colgroup><thead><tr><td align="left"><hr/>Protein Name</td><td><hr/>Hs Chr</td><td><hr/>Mm Chr</td><td><hr/>Class</td><td><hr/>S1 versus S2 <italic>p</italic>-Value</td><td><hr/>S1 versus S3 <italic>p</italic>-Value</td><td><hr/>Human EST</td><td><hr/>Mouse EST</td><td><hr/>Human mRNA</td><td><hr/>Mouse mRNA</td></tr></thead><tbody><tr><td>ATX1</td><td>16</td><td>8</td><td>1</td><td>&lt;0.001</td><td>0.030</td><td>6</td><td>4</td><td>1</td><td>1</td></tr><tr><td>ATXN7L3</td><td>12</td><td>10</td><td>1</td><td>&lt;0.001</td><td>&lt;0.001</td><td>&gt;50</td><td>&gt;50</td><td>2</td><td>1</td></tr><tr><td>IMB1</td><td>X</td><td>X</td><td>1</td><td>&lt;0.001</td><td>&lt;0.001</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><td>PDZRN3</td><td>12</td><td>15</td><td>1</td><td>&lt;0.001</td><td>&lt;0.001</td><td>4</td><td>1</td><td>0</td><td>1</td></tr><tr><td>DYHC</td><td>11</td><td>7</td><td>2</td><td>&lt;0.001</td><td>&lt;0.001</td><td>4</td><td>6</td><td>0</td><td>1</td></tr><tr><td>ODF3</td><td>22</td><td>15</td><td>2</td><td>&lt;0.001</td><td>0.073</td><td>23</td><td>7</td><td>0</td><td>4</td></tr><tr><td>A8A1</td><td>15</td><td>7</td><td>3</td><td>&lt;0.001</td><td>0.065</td><td>1</td><td>9</td><td>1</td><td>1</td></tr><tr><td>TPC3</td><td>6</td><td>10</td><td>3</td><td>&lt;0.001</td><td>0.005</td><td>3</td><td>1</td><td>0</td><td>0</td></tr><tr><td>Q9P2K1</td><td>10</td><td>19</td><td>3</td><td>&lt;0.001</td><td>&lt;0.001</td><td>0</td><td>0</td><td>1</td><td>1</td></tr><tr><td>ZNF629</td><td>1</td><td>1</td><td>3</td><td>&lt;0.001</td><td>0.002</td><td>0</td><td>0</td><td>1</td><td>1</td></tr><tr><td>CA1C</td><td>3</td><td>9</td><td>4</td><td>&lt;0.001</td><td>0.058</td><td>1</td><td>0</td><td>0</td><td>1</td></tr><tr><td>DD17</td><td>1</td><td>1</td><td>4</td><td>&lt;0.001</td><td>0.076</td><td>13</td><td>8</td><td>2</td><td>3</td></tr><tr><td>Q7Z3F3</td><td>4</td><td>5</td><td>4</td><td>&lt;0.001</td><td>&lt;0.001</td><td>13</td><td>4</td><td>3</td><td>3</td></tr><tr><td>Q8IYB1</td><td>17</td><td>11</td><td>4</td><td>&lt;0.001</td><td>0.002</td><td>18</td><td>3</td><td>1</td><td>1</td></tr><tr><td>Q8N1K5</td><td>18</td><td>17</td><td>4</td><td>&lt;0.001</td><td>&lt;0.001</td><td>0</td><td>9</td><td>0</td><td>1</td></tr><tr><td>DNAH5</td><td>17</td><td>11</td><td>4</td><td>&lt;0.001</td><td>0.007</td><td>1</td><td>1</td><td>1</td><td>2</td></tr><tr><td>ERBB2IP</td><td>3</td><td>3</td><td>4</td><td>&lt;0.001</td><td>0.030</td><td>1</td><td>0</td><td>1</td><td>1</td></tr><tr><td>TOPORS</td><td>9</td><td>4</td><td>4</td><td>&lt;0.001</td><td>0.047</td><td>0</td><td>4</td><td>0</td><td>2</td></tr><tr><td>TDR1</td><td>2</td><td>12</td><td>4</td><td>&lt;0.001</td><td>0.035</td><td>0</td><td>0</td><td>1</td><td>2</td></tr><tr><td>Z142</td><td>4</td><td>8</td><td>4</td><td>&lt;0.001</td><td>0.027</td><td>6</td><td>4</td><td>1</td><td>0</td></tr></tbody></table> --><!-- --></table-wrap><p>Among the 20 syntenic S1 sequence pairs, the only completely unexpressed example is the IMB1 copy found on the X chromosome. This pair shows clear preference for <italic>S</italic>1, and is also unusually well-preserved for a nonfunctional pseudogene. This might indicate that the IMB1 pseudogene was functional for a short period after the species split, but it could also simply be an effect of the X-chromosome's lower mutation rate [<xref ref-type="bibr" rid="pcbi-0020046-b020">20</xref>].</p><p>The majority of these pseudogenes are much less expressed than their respective genes (to the latter, one can generally map large numbers of ESTs originating from tissues throughout the body).</p><p>To perform an enrichment test we need a good comparison set. We believe that <italic>S</italic>3 contains many young pseudogenes, that is, those which recently underwent the transition from gene to pseudogene, but also protein-coding genes. It is reasonable to assume that young pseudogenes are more frequently transcribed than older ones. For these reasons, <italic>S</italic>3 is not a good comparison set, and no other such set is available either.</p><p>Instead, we focus on the correlation between the <italic>S</italic>3 pairs' positioning of the gene-to-pseudogene transitions (see <xref ref-type="fig" rid="pcbi-0020046-g007">Figure 7</xref>) and their pseudogene expression. For this we adopted the following labeling scheme with notation from <xref ref-type="fig" rid="pcbi-0020046-g007">Figure 7</xref>:</p><disp-formula id="pcbi-0020046-e001"><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.e001" xlink:type="simple"/><!-- <mml:math display='block'><mml:mrow><mml:mtext>genelike&minus;if</mml:mtext><mml:mspace width='2pt'/><mml:msubsup><mml:mi>t</mml:mi><mml:mi>G</mml:mi><mml:mrow><mml:mi>H</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>/</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mi>&ipsi;</mml:mi><mml:mrow><mml:mi>H</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>&gt;</mml:mo><mml:mn>10</mml:mn><mml:mspace width='2pt'/><mml:mtext>and</mml:mtext><mml:mspace width='2pt'/><mml:msubsup><mml:mi>t</mml:mi><mml:mi>G</mml:mi><mml:mrow><mml:mi>M</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup><mml:mo>/</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mi>&ipsi;</mml:mi><mml:mrow><mml:mi>M</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup><mml:mo>&gt;</mml:mo><mml:mn>10</mml:mn></mml:mrow></mml:math> --></disp-formula><disp-formula id="pcbi-0020046-e002"><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.e002" xlink:type="simple"/><!-- <mml:math display='block'><mml:mrow><mml:mtext>late&minus;if</mml:mtext><mml:mspace width='2pt'/><mml:msubsup><mml:mi>t</mml:mi><mml:mi>G</mml:mi><mml:mrow><mml:mi>H</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>/</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mi>&ipsi;</mml:mi><mml:mrow><mml:mi>H</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>&gt;</mml:mo><mml:mn>3</mml:mn><mml:mspace width='2pt'/><mml:mtext>and</mml:mtext><mml:mspace width='2pt'/><mml:msubsup><mml:mi>t</mml:mi><mml:mi>G</mml:mi><mml:mrow><mml:mi>M</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup><mml:mo>/</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mi>&ipsi;</mml:mi><mml:mrow><mml:mi>M</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup><mml:mo>&gt;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:math> --></disp-formula><disp-formula id="pcbi-0020046-e003"><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.e003" xlink:type="simple"/><!-- <mml:math display='block'><mml:mrow><mml:mtext>medium&minus;if</mml:mtext><mml:mspace width='2pt'/><mml:msubsup><mml:mi>t</mml:mi><mml:mi>G</mml:mi><mml:mrow><mml:mi>H</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>/</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mi>&ipsi;</mml:mi><mml:mrow><mml:mi>H</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>&gt;</mml:mo><mml:mn>1</mml:mn><mml:mspace width='2pt'/><mml:mtext>and</mml:mtext><mml:mspace width='2pt'/><mml:msubsup><mml:mi>t</mml:mi><mml:mi>G</mml:mi><mml:mrow><mml:mi>M</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup><mml:mo>/</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mi>&ipsi;</mml:mi><mml:mrow><mml:mi>M</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup><mml:mo>&gt;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math> --></disp-formula><disp-formula id="pcbi-0020046-e004"><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.e004" xlink:type="simple"/><!-- <mml:math display='block'><mml:mrow><mml:mtext>early&minus;if</mml:mtext><mml:mspace width='2pt'/><mml:msubsup><mml:mi>t</mml:mi><mml:mi>G</mml:mi><mml:mrow><mml:mi>H</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>/</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mi>&ipsi;</mml:mi><mml:mrow><mml:mi>H</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>&lt;</mml:mo><mml:mn>1</mml:mn><mml:mspace width='2pt'/><mml:mtext>and</mml:mtext><mml:mspace width='2pt'/><mml:msubsup><mml:mi>t</mml:mi><mml:mi>G</mml:mi><mml:mrow><mml:mi>M</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup><mml:mo>/</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:mi>&ipsi;</mml:mi><mml:mrow><mml:mi>M</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup><mml:mo>&lt;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math> --></disp-formula><disp-formula id="pcbi-0020046-e005"><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.e005" xlink:type="simple"/><!-- <mml:math display='block'><mml:mtext>unclear&minus;otherwise.</mml:mtext></mml:math> --></disp-formula><fig id="pcbi-0020046-g007" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020046.g007</object-id><label>Figure 7</label><caption><title><italic>S</italic>3 Topology with Gene-to-Pseudogene Breakpoints</title><p>
											<inline-formula id="pcbi-0020046-ex001"><inline-graphic mimetype="image" xlink:href="info:doi/10.1371/journal.pcbi.0020046.ex001" xlink:type="simple"/></inline-formula>
											refers to the length of the branch on which the human pseudogene has evolved genelike, and similarly for
											<inline-formula id="pcbi-0020046-ex002"><inline-graphic mimetype="image" xlink:href="info:doi/10.1371/journal.pcbi.0020046.ex002" xlink:type="simple"/></inline-formula>
											<italic>,</italic>
											<inline-formula id="pcbi-0020046-ex003"><inline-graphic mimetype="image" xlink:href="info:doi/10.1371/journal.pcbi.0020046.ex003" xlink:type="simple"/></inline-formula>
											<italic>,</italic> and
											<inline-formula id="pcbi-0020046-ex004"><inline-graphic mimetype="image" xlink:href="info:doi/10.1371/journal.pcbi.0020046.ex004" xlink:type="simple"/></inline-formula>
											.
										</p></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.g007" xlink:type="simple"/></fig><p>If we assume that the rate of pseudogene creation does not vary over time, then the low number of detected early pairs—only nine out of 198 non-unclear <italic>S</italic>3 examples conform to this group—is a sign that most pairs of the same age as the early pairs have diverged beyond recognition.</p><p><xref ref-type="table" rid="pcbi-0020046-t004">Table 4</xref> shows the number of examples in each group together with an evaluation of their tendency to be expressed. We note that whereas in the genelike and late groups a large majority (94% and 80%, respectively) is expressed in both organisms, the figures are much lower for the early (22%) and unclear (44%) ones. We could also compare these figures with the corresponding figures for <italic>S</italic>1 examples where 22 out of 30 examples (73%) are expressed in both organisms. As can be seen in <xref ref-type="table" rid="pcbi-0020046-t004">Table 4</xref>, this tendency is even more pronounced if we count only syntenic examples. It is reasonable to conclude that the majority of the early pairs are nonfunctional, because they are expressed to such a low extent. Considering the higher age and the extent of expression for the 20 <italic>S</italic>1 pseudogenes, it is also reasonable to conclude that this set contains pseudogenes of potential biological function.</p><table-wrap content-type="1col" id="pcbi-0020046-t004" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020046.t004</object-id><label>Table 4</label><caption><p>Human and Mouse Expression for 262 S3 Quartets Selected as Described in <xref ref-type="table" rid="pcbi-0020046-t001">Table 1</xref></p></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.t004" xlink:type="simple"/><!-- <table frame="hsides" rules="none"><colgroup><col id="tb4col1" align="left" charoff="0" char=""/><col id="tb4col2" align="char" charoff="0" char="("/><col id="tb4col3" align="char" charoff="0" char="."/><col id="tb4col4" align="char" charoff="0" char="."/><col id="tb4col5" align="char" charoff="0" char="("/></colgroup><thead><tr><td align="left" rowspan="2"><hr/>S3 Type</td><td colspan="3"><hr/>Expression</td><td rowspan="2"><hr/>Total</td></tr><tr><td><hr/>Both</td><td><hr/>One</td><td><hr/>None</td></tr></thead><tbody><tr><td>Genelike</td><td>33 (27)</td><td>2 (1)</td><td>0 (0)</td><td>35 (28)</td></tr><tr><td>Late</td><td>99 (67)</td><td>19 (8)</td><td>6 (4)</td><td>124 (79)</td></tr><tr><td>Medium</td><td>23 (15)</td><td>5 (0)</td><td>2 (0)</td><td>30 (15)</td></tr><tr><td>Early</td><td>2 (1)</td><td>7 (2)</td><td>0 (0)</td><td>9 (3)</td></tr><tr><td>Unclear</td><td>28 (16)</td><td>29 (7)</td><td>7 (1)</td><td>64 (24)</td></tr><tr><td>S1</td><td>22 (17)</td><td>6 (2)</td><td>2 (1)</td><td>30 (20)</td></tr></tbody></table> --><!-- <table-wrap-foot><fn id="nt401"><p>Figures in parentheses correspond to the number of syntenic examples.</p></fn></table-wrap-foot> --></table-wrap></sec><sec id="s2c"><title>Conservation</title><p>According to estimates in [<xref ref-type="bibr" rid="pcbi-0020046-b021">21</xref>], the neutral rate of substitution has been roughly 0.5 substitutions per site since the divergence of the human and mouse lineages. This estimate conforms to 67% sequence identity for orthologous regions under no selective pressure. At the other extreme, protein-coding sequences have, on average, approximately 85% conservation [<xref ref-type="bibr" rid="pcbi-0020046-b021">21</xref>].</p><p>We will now address where our putatively functional pseudogenes are placed along that scale. We note (<xref ref-type="fig" rid="pcbi-0020046-g008">Figure 8</xref> and <xref ref-type="table" rid="pcbi-0020046-t005">Table 5</xref>) that although the conservation for the 20 syntenic <italic>S</italic>1 pseudogene pairs is not as strong as for the corresponding genes, it is in most cases significantly above the 67% limit (<italic>p</italic>-values are computed using Hoeffding's bound, see <xref ref-type="sec" rid="s4">Materials and Methods</xref>). For instance, the ATX1 derivative shows conservation at least as high as a typical gene, even slightly higher than its paralogous gene (<xref ref-type="fig" rid="pcbi-0020046-g008">Figure 8</xref>). The ATXN7L3 duplicate, previously discussed, has conservation similar to that of a protein-coding gene. It is 77% conserved, counted over the total alignment, but a 288-bp-long section in the beginning is 89% conserved (<xref ref-type="fig" rid="pcbi-0020046-g009">Figure 9</xref>).</p><fig id="pcbi-0020046-g008" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020046.g008</object-id><label>Figure 8</label><caption><title> Conservation between Human and Mouse Gene and Pseudogene Sequences for the 20 Syntenic <italic>S</italic>1 Sequences</title><p>Blue stars indicate genes. Red circles indicate pseudogenes. The histogram shows, for reference, the conservation of all genes giving rise to pseudogenes. Compare with <xref ref-type="table" rid="pcbi-0020046-t005">Table 5</xref>, which lists the same data.</p></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.g008" xlink:type="simple"/></fig><table-wrap content-type="1col" id="pcbi-0020046-t005" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020046.t005</object-id><label>Table 5</label><caption><p>Conservation Percentage in and around the Pseudogene</p></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.t005" xlink:type="simple"/><!-- <table frame="hsides" rules="none"><colgroup><col id="tb5col1" align="left" charoff="0" char=""/><col id="tb5col2" align="char" charoff="0" char="."/><col id="tb5col3" align="char" charoff="0" char="."/><col id="tb5col4" align="char" charoff="0" char="."/><col id="tb5col5" align="char" charoff="0" char="."/></colgroup><thead><tr><td align="left" rowspan="2"><hr/>Protein Name</td><td colspan="2"><hr/>Conservation</td><td rowspan="2"><hr/>Upstream</td><td rowspan="2"><hr/>Downstream</td></tr><tr><td><hr/>Percent</td><td><hr/><italic>p</italic>-Value</td></tr></thead><tbody><tr><td>ATX1</td><td>91.1 &percnt;</td><td align="left">&lt;10<sup>&minus;50</sup></td><td>63.3 &percnt;</td><td>75.7 &percnt;</td></tr><tr><td>ATXN7L3</td><td>76.7 &percnt;</td><td>3.75*10<sup>&minus;9</sup></td><td>69.4 &percnt;</td><td>75.6 &percnt;</td></tr><tr><td>IMB1</td><td>72.9 &percnt;</td><td align="left">&lt;10<sup>&minus;50</sup></td><td>65.3 &percnt;</td><td>63.7 &percnt;</td></tr><tr><td>PDZRN3</td><td>79.4 &percnt;</td><td>1.25*10<sup>&minus;8</sup></td><td>61.8 &percnt;</td><td>47.2 &percnt;</td></tr><tr><td>DYHC</td><td>79.1 &percnt;</td><td align="left">&lt;10<sup>&minus;50</sup></td><td>74.1 &percnt;</td><td>75.1 &percnt;</td></tr><tr><td>ODF3</td><td>70.7 &percnt;</td><td>0.036</td><td>58.6 &percnt;</td><td>73.4 &percnt;</td></tr><tr><td>A8A1</td><td>89.7 &percnt;</td><td>1.13*10<sup>&minus;16</sup></td><td>71.5 &percnt;</td><td>44.2 &percnt;</td></tr><tr><td>TPC3</td><td>89.4 &percnt;</td><td>1.84*10<sup>&minus;13</sup></td><td>47.7 &percnt;</td><td>60.3 &percnt;</td></tr><tr><td>Q9P2K1</td><td>85.2 &percnt;</td><td>1.67*10<sup>&minus;8</sup></td><td>62.0 &percnt;</td><td>62.3 &percnt;</td></tr><tr><td>ZNF629</td><td>84.8 &percnt;</td><td>1.25*10<sup>&minus;35</sup></td><td>69.2 &percnt;</td><td>46.5 &percnt;</td></tr><tr><td>CA1C</td><td>84.5 &percnt;</td><td>2.29*10<sup>&minus;20</sup></td><td>70.0 &percnt;</td><td>53.3 &percnt;</td></tr><tr><td>DD17</td><td>86.1 &percnt;</td><td>1.01*10<sup>&minus;13</sup></td><td>74.3 &percnt;</td><td>53.1 &percnt;</td></tr><tr><td>Q7Z3F3</td><td>100 &percnt;</td><td>1.64*10<sup>&minus;19</sup></td><td>64.1 &percnt;</td><td>66.1 &percnt;</td></tr><tr><td>Q8IYB1</td><td>82.7 &percnt;</td><td>1.02*10<sup>&minus;20</sup></td><td>76.4 &percnt;</td><td>66.4 &percnt;</td></tr><tr><td>Q8N1K5</td><td>73.8 &percnt;</td><td>5.62*10<sup>&minus;6</sup></td><td>45.7 &percnt;</td><td>47.5 &percnt;</td></tr><tr><td>DNAH5</td><td>88.1 &percnt;</td><td>1.64*10<sup>&minus;8</sup></td><td>43.9 &percnt;</td><td>56.2 &percnt;</td></tr><tr><td>ERBB2IP</td><td>75.9 &percnt;</td><td>2.06*10<sup>&minus;9</sup></td><td>55.7 &percnt;</td><td>55.9 &percnt;</td></tr><tr><td>TOPORS</td><td>71.1 &percnt;</td><td>7.30*10<sup>&minus;4</sup></td><td>70.0 &percnt;</td><td>46.0 &percnt;</td></tr><tr><td>TDR1</td><td>82.8 &percnt;</td><td>2.35*10<sup>&minus;7</sup></td><td>45.4 &percnt;</td><td>46.5 &percnt;</td></tr><tr><td>Z142</td><td>96.5 &percnt;</td><td>1.69*10<sup>&minus;49</sup></td><td>74.0 &percnt;</td><td>86.7 &percnt;</td></tr></tbody></table> --><!-- --></table-wrap><fig id="pcbi-0020046-g009" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020046.g009</object-id><label>Figure 9</label><caption><title>Recognizing Pseudogenes by Inspecting Their Alignment</title><p>(A) An alignment, visualized with TeXshade [<xref ref-type="bibr" rid="pcbi-0020046-b034">34</xref>], of the processed copies to the ATXN7L3 human and mouse protein-coding genes. The human as well as the mouse ATXN7L3 contains 12 exons, which are all present in the respective duplicates. Approximate exon borders are shown in yellow.</p><p>The most interesting part consists of columns 1–468 (boxed green), which according to several EST and mRNA sequences is the only segment expressed. It consists of a highly conserved part, 1–288 (red), which is a potential open reading frame, followed by part 289–468 with pseudogenic disablements.</p><p>(B) Selected parts of the alignment of the ATX1 copies which are also processed. The protein-coding genes contain eight exons of which only parts of the last two code for protein. The entire segment of the pseudogenes corresponding to the protein-coding parts of the genes is expressed. The possibility that the processed copies are protein-coding cannot not be completely ruled out, however. Indeed, each pseudogene consists of one single 2,068-bp-long open reading frame. However, the frame induced by the alignments to the protein-coding genes contains several pseudogenic disablements.</p></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.g009" xlink:type="simple"/></fig><p>Substitution rates vary along and between chromosomes. To make sure that it is the pseudogenes only, and not their genomic vicinities in general, that are conserved, we also aligned a 1,000-bp section upstream and downstream of each pseudogene. We observe in most cases (<xref ref-type="table" rid="pcbi-0020046-t005">Table 5</xref>) that the conservation for the surroundings is about the expected 67% and much lower than for the actual pseudogene. The flanking sequences have in most cases about the 67% conservation that we expect. The unexpectedly high value registered downstream of, among others, the ATX1 relative, might be due to conservation of the 3′ UTR.</p><p>Conceivably, a potential pseudogene could in its close vicinity have protein-coding exons originating from the same gene. To exclude this possibility, we also checked the proximity for signs of exons originating from the same gene, with potentially intact protein-coding ability. No additional such protein-coding exons were found. For the absolute majority of our pseudogenes, no hit could be found on the same chromosome, and in no case was any hit found closer than 10,000 bp.</p></sec><sec id="s2d"><title>Human–Chimpanzee Results</title><p>We also applied our methodology to the human–chimpanzee pair of genomes. This choice was motivated by our desire to discover young pseudogenes. Remember that the mouse Makorin pseudogene, although vital, has only been functional over a relatively short evolutionary period [<xref ref-type="bibr" rid="pcbi-0020046-b007">7</xref>].</p><p>The procedure was the same as for human–mouse. The chimpanzee data was downloaded from Ensembl, including assembly 1 as of April 2005 together with protein sequences and gene-sequence data. For human–chimpanzee, sequence conservation is less effective as a means to separate functional from nonfunctional pseudogenes. The reason is of course that many pseudogenes originating before the comparatively recent primate species split can be expected to be nonfunctional, although they have not diverged sufficiently to be easily recognized as such. So, while in the human–mouse case we can be relatively confident that syntenic pseudogenes that prefer <italic>S</italic>1 are functional, it is likely that many <italic>S</italic>1 pseudogenes found in a human–chimpanzee comparison are nonfunctional. In fact, conservation estimates can, even together with expression evidence, be expected to be insufficient for revealing whether an individual pseudogene is functional or not. What we can hope for is a signal in the data showing that the quartets preferring <italic>S</italic>1 include functional pseudogenes.</p><p>As expected, the human–chimpanzee comparison resulted in a large set of pseudogene pairs. We therefore restricted our analysis to the most interesting class, i.e., class 1. We found 742 class 1 pseudogenes belonging to quartets favoring <italic>S</italic>1 (using <italic>p</italic>-value 0.001 for comparing <italic>S</italic>1 and <italic>S</italic>2 and 0.1 for comparing <italic>S</italic>1 and <italic>S</italic>3). The aforementioned class 1 pseudogenes found in the human–mouse comparison all belong to this set. We are fairly confident that these 742 sequences have indeed evolved in a manner atypical for a protein-coding gene. A key question is whether these are functional or not. Note that more than 1/5 of them show transcriptional evidence (<xref ref-type="table" rid="pcbi-0020046-t006">Table 6</xref>).</p><table-wrap content-type="1col" id="pcbi-0020046-t006" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020046.t006</object-id><label>Table 6</label><caption><p>Percentage of Expressed Pseudogenes in Relation to Their Conservation <italic>p</italic>-Values (Calculated with Hoeffding's Bound)</p></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.t006" xlink:type="simple"/><!-- <table frame="hsides" rules="none"><colgroup><col id="tb6col1" align="left" charoff="0" char=""/><col id="tb6col2" align="char" charoff="0" char="."/><col id="tb6col3" align="char" charoff="0" char="."/><col id="tb6col4" align="char" charoff="0" char="."/><col id="tb6col5" align="char" charoff="0" char="."/></colgroup><thead><tr><td align="left" rowspan="2"><hr/><italic>p</italic>-Value</td><td rowspan="2"><hr/>Total Number</td><td colspan="3"><hr/>EST/mRNA Expression</td></tr><tr><td><hr/>EST</td><td><hr/>mRNA</td><td><hr/>Either</td></tr></thead><tbody><tr><td>0.01</td><td>27</td><td>19&percnt;</td><td>22&percnt;</td><td>33&percnt;</td></tr><tr><td>0.05</td><td>77</td><td>17&percnt;</td><td>19&percnt;</td><td>29&percnt;</td></tr><tr><td>Total</td><td>742</td><td>12&percnt;</td><td>15&percnt;</td><td>21&percnt;</td></tr></tbody></table> --><!-- --></table-wrap><p>Many pseudogenes have regulatory regions showing high similarity to those of the corresponding protein-coding genes. This is either because few mutations have occurred in these regions, or alternatively because many of the mutations that have occurred have been selected against, due to functionality of the pseudogenes.</p><p>To further purify our result set, i.e., the 742 pseudogenes favoring <italic>S</italic>1, we again looked at significant deviations of conservation. This approach requires a reliable estimate of the background conservation percentage and we used 5,000-bp-long alignments flanking each pseudogene to compute such an estimate.</p><p>Typical values for the background mismatch percentage range from 1.2% to 3% (counting the first but not subsequent indels in a gap), which conforms well with previous estimates of 1.4% [<xref ref-type="bibr" rid="pcbi-0020046-b022">22</xref>] and 1.6% [<xref ref-type="bibr" rid="pcbi-0020046-b023">23</xref>]. Note that the percentage of pseudogenes with EST and/or mRNA expression evidence is higher for more conserved pseudogenes. <xref ref-type="table" rid="pcbi-0020046-t007">Table 7</xref> contains a list of those pseudogenes that we have found to be expressed as well as conserved since the human–chimpanzee speciation. Notable is that the pseudogenes originating from proteins ATXN7L3, PDZRN3, or IMB1 are not found in <xref ref-type="table" rid="pcbi-0020046-t007">Table 7</xref>. ATXN7L3 and PDZRN3 are not sufficiently conserved (in the former case we remember from the human–mouse analysis that it is only sections of the pseudogenes that exhibit exceptional conservation). The IMB1 pair residing on chromosome X is indeed conserved enough (it has a conservation <italic>p</italic>-value of 4.3*10<sup>−4</sup>) but lacks, as was noted in the human–mouse section, expression evidence.</p><table-wrap content-type="2col" id="pcbi-0020046-t007" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020046.t007</object-id><label>Table 7</label><caption><p>Human–Chimpanzee Conserved and Expressed Pseudogene Pairs</p></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.t007" xlink:type="simple"/><!-- <table frame="hsides" rules="none"><colgroup><col id="tb7col1" align="left" charoff="0" char=""/><col id="tb7col2" align="left" charoff="0" char=""/><col id="tb7col3" align="char" charoff="0" char="."/><col id="tb7col4" align="left" charoff="0" char=""/><col id="tb7col5" align="left" charoff="0" char=""/><col id="tb7col6" align="char" charoff="0" char="."/><col id="tb7col7" align="left" charoff="0" char=""/><col id="tb7col8" align="left" charoff="0" char=""/></colgroup><thead><tr><td align="left">Hs Protein</td><td>Gene Name</td><td>Hs Chr</td><td>Hs Start</td><td>Hs End</td><td>Conservation <italic>p</italic>-Value</td><td colspan="2"><hr/>Expression</td></tr><tr><td colspan='6'><hr/></td><td><hr/>EST</td><td><hr/>mRNA</td></tr></thead><tbody><tr><td>ENSP00000244769</td><td>ATX1</td><td>16</td><td>70441078</td><td>70443214</td><td>0.019</td><td>Yes</td><td>Yes</td></tr><tr><td>ENSP00000262316</td><td>RHBDF1</td><td>3</td><td>14589363</td><td>14591300</td><td>0.0022</td><td>No</td><td>Yes</td></tr><tr><td>ENSP00000234739</td><td>BCL9</td><td>5</td><td>66968594</td><td>66970526</td><td>6.1*10<sup>&minus;4</sup></td><td>Yes</td><td>No</td></tr><tr><td>ENSP00000235329</td><td>MFN2</td><td>X</td><td>108617852</td><td>108619651</td><td>1.1*10<sup>&minus;8</sup></td><td>No</td><td>Yes</td></tr><tr><td>ENSP00000327539</td><td>HNRPH1</td><td>X</td><td>142485593</td><td>142486960</td><td>0.0030</td><td>No</td><td>Yes</td></tr><tr><td>ENSP00000268661</td><td>RPL3L</td><td>5</td><td>60722282</td><td>60723464</td><td>0.035</td><td>No</td><td>Yes</td></tr><tr><td>ENSP00000313007</td><td>PABPC1</td><td>12</td><td>62502005</td><td>62503947</td><td>0.044</td><td>No</td><td>Yes</td></tr><tr><td>ENSP00000318000</td><td>NAB1</td><td>X</td><td>150065269</td><td>150067075</td><td>0.0021</td><td>Yes</td><td>No</td></tr><tr><td>ENSP00000327539</td><td>HNRPH1</td><td>6</td><td>160104224</td><td>160105428</td><td>0.014</td><td>Yes</td><td>No</td></tr><tr><td>ENSP00000223215</td><td>MEST</td><td>3</td><td>29103895</td><td>29104914</td><td>0.0010</td><td>No</td><td>Yes</td></tr><tr><td>ENSP00000349469</td><td>TPR4</td><td>X</td><td>92348904</td><td>92349903</td><td>0.0038</td><td>Yes</td><td>Yes</td></tr><tr><td>ENSP00000341327</td><td>SOCS4</td><td>6</td><td>113650996</td><td>113651931</td><td>0.032</td><td>No</td><td>Yes</td></tr><tr><td>ENSP00000313582</td><td>ZNF436</td><td>7</td><td>6465488</td><td>6467284</td><td>0.027</td><td>Yes</td><td>Yes</td></tr><tr><td>ENSP00000342024</td><td>ATP8A1</td><td>2</td><td>241221794</td><td>241223519</td><td>0.025</td><td>Yes</td><td>No</td></tr><tr><td>ENSP00000302684</td><td>DKFZp343F142</td><td>7</td><td>65814628</td><td>65815402</td><td>0.011</td><td>Yes</td><td>No</td></tr><tr><td>ENSP00000319053</td><td>ZNF77</td><td>19</td><td>9495628</td><td>9496170</td><td>0.0077</td><td>Yes</td><td>No</td></tr><tr><td>ENSP00000317614</td><td>NP444270</td><td>10</td><td>97910042</td><td>97910556</td><td>0.0090</td><td>Yes</td><td>Yes</td></tr><tr><td>ENSP00000307858</td><td>ZBTB4</td><td>3</td><td>142645022</td><td>142645752</td><td>0.020</td><td>Yes</td><td>Yes</td></tr><tr><td>ENSP00000319233</td><td>TLE3</td><td>16</td><td>70023164</td><td>70024179</td><td>0.019</td><td>No</td><td>Yes</td></tr><tr><td>ENSP00000256682</td><td>ARF3</td><td>17</td><td>41069429</td><td>41069752</td><td>0.048</td><td>Yes</td><td>Yes</td></tr><tr><td>ENSP00000257498</td><td>CTSL</td><td>10</td><td>89137032</td><td>89139195</td><td>0.037</td><td>No</td><td>Yes</td></tr><tr><td>ENSP00000274192</td><td>SRD5A1</td><td>X</td><td>138254579</td><td>138255358</td><td>0.050</td><td>Yes</td><td>No</td></tr></tbody></table> --><!-- --></table-wrap></sec></sec><sec id="s3"><title>Discussion</title><p>We have presented and applied a semi-automated methodology to identify pseudogenes of potential biological function. To the best of our knowledge, functional pseudogenes have never been observed in human. Our method uses no prior knowledge other than publicly available data on orthologous relationships for proteins, gene sequences, gene positions, and synteny maps.</p><p>The term <italic>pseudogene</italic> is normally used for sequences derived from known proteins but with detectable disablements that make the translation to protein impossible. Detecting pseudogenes is complicated by the possibility that part of the copy can be disabled, while the rest is coding.</p><p>We use conserved ancient pseudogenes as candidates of potential function. A computational approach based on support for four different evolutionary scenarios is used to obtain putative ancient pseudogenes. The <italic>p</italic>-value thresholds used, as well as the tests applied, indicate that the set of putative ancient pseudogenes is significantly enriched for ancient pseudogenes. It is interesting to ask whether there are evolutionary mechanisms that could cause our scenario <italic>S</italic>1 to appear more likely than another correct scenario, i.e., mechanisms not taken into account in our approach. Notice that, for instance, homogenization, e.g., through gene conversion, cannot make <italic>S</italic>1 more likely, since <italic>S</italic>1 is supported by similarity between sequences in <italic>different</italic> species.</p><p>We test functionality of our candidates by means of enrichment of synteny as well as of transcriptional activity and degree of conservation. We see, as expected, a clear overrepresentation of synteny for human–mouse pseudogene pairs originating before the species split. Interestingly, we also see tendencies for those examples that have evolved as pseudogenes since the species split to be both more abundantly expressed and more often syntenic than those that have not evolved as pseudogenes. For the latter finding, we believe that enrichment of functionality among our pseudogenes is the most likely explanation.</p><p>Judging from what is known from earlier work, the number of detectable pseudogenes originating from before the human–mouse speciation is limited. In [<xref ref-type="bibr" rid="pcbi-0020046-b012">12</xref>], the authors found 11 examples of potentially orthologous pseudogene pairs. Although there is considerable overlap between their analysis and ours, a numerical comparison is not straightforward. First, they have stricter criteria for classifying a sequence as a pseudogene (deploying a careful filtering to make sure that only sequences that are processed pseudogenes are investigated). Second, they compute human and mouse orthologs using reciprocal BLAST comparisons only, and investigate how many of the transcribed human genes have mouse orthologs.</p><p>Three out of four (PDZRN3 being the exception) of our <italic>S</italic>1 examples belonging to class 1 are also classified as pseudogenes in [<xref ref-type="bibr" rid="pcbi-0020046-b008">8</xref>,<xref ref-type="bibr" rid="pcbi-0020046-b012">12</xref>]. We use the same databases for expression analyses as do the authors of [<xref ref-type="bibr" rid="pcbi-0020046-b012">12</xref>], and, as expected, the results are in agreement.</p><p>To determine functionality of a human pseudogene, it is probably not sufficient to use information about whether it has a mouse ortholog or not, because many young pseudogenes can be found among orthologous pairs. Instead, we select only those human pseudogenes with orthologous mouse pseudogenes that satisfy the additional constraint that the least common ancestor was a pseudogene.</p><p>The results we present suggest that while functional pseudogenes are relatively rare on a long evolutionary timescale, they nevertheless exist. Our findings include a handful of sequences that are conserved since before the split of primates and rodents. Some of these are sequences predicted by gene finders to be protein-coding. We have found examples with, as well as without, detectable in-frame disablements. Apart from their apparent functional conservation and sometimes extensive expression activity, all these are poorly characterized. This can be due to the fact that some of the originating proteins are themselves not very well known or to the common assumption that pseudogenes are nonfunctional. Further characterization of these genes, their respective pseudogenes, and the interactions between them are areas for further studies.</p><p>We have noted with interest recent research activity concerning two of our top candidates, ATX1 and ATXN7L3. As is the case for the Ataxin gene family in general, these are associated with a number of neurodegenerative disorders primarily caused by expanded polyglutamine [<xref ref-type="bibr" rid="pcbi-0020046-b024">24</xref>,<xref ref-type="bibr" rid="pcbi-0020046-b025">25</xref>], but other than that their function is currently unknown [<xref ref-type="bibr" rid="pcbi-0020046-b026">26</xref>]. Findings indicating that these genes have a regulating function [<xref ref-type="bibr" rid="pcbi-0020046-b027">27</xref>,<xref ref-type="bibr" rid="pcbi-0020046-b028">28</xref>] are of particular interest because it is reasonable to believe that their paralogs, if they are indeed functional pseudogenes, work on the RNA level. There are previously known examples of ncRNA genes in the Ataxin gene family. In [<xref ref-type="bibr" rid="pcbi-0020046-b027">27</xref>] it is shown that Ataxin 8 regulates Kelch-like 1 by anti-sense regulation.</p><p>When extending the search to younger pseudogenes, i.e., applying our methodology to human–chimpanzee, the number of obtained pseudogenes is substantially larger than what was obtained in the human–mouse comparison. In this case, however, the assumption that nonfunctional pseudogenes originating before the speciation have diverged beyond recognition is not true. Consequently, filtering out nonfunctional pseudogenes is much harder than for the human–mouse case. Encouragingly, we found that the conservation of many pseudogenes is similar to that of nonsynonymous nucleotides in protein-coding genes (estimated to be 99.4% [<xref ref-type="bibr" rid="pcbi-0020046-b023">23</xref>]).</p><p>There is an apparent tradeoff between the number of pseudogenes in the result set and the certainty with which we can state that they are functional. It is quite possible that both our choices of species pairs are in fact suboptimal, human–mouse being too evolutionarily distant and human–chimpanzee not distant enough. It will be interesting to apply our methodology on an intermediate timescale, and we plan to conduct a comparison between human and rhesus macaque.</p></sec><sec id="s4"><title>Materials and Methods</title><p>Our methodology includes three main parts (see <xref ref-type="fig" rid="pcbi-0020046-g010">Figure 10</xref>), here termed: 1) pseudogene finding, 2) cross-species matching, and 3) pseudogene pair evaluation.</p><fig id="pcbi-0020046-g010" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0020046.g010</object-id><label>Figure 10</label><caption><title>Flow Diagram over the Pseudogene Assignment Process</title></caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.g010" xlink:type="simple"/></fig><p>To locate pseudogenes, we adopted a large part of the methodology presented in [<xref ref-type="bibr" rid="pcbi-0020046-b008">8</xref>]. When searching for pseudogenes, a hit was considered to be significant if it had an e-value &lt; 10<sup>−10</sup> from a six-frame TBLASTN search with the protein sequence (BLAST-package was downloaded from <ext-link ext-link-type="ftp" xlink:href="ftp://ftp.ncbi.nih.gov/blast" xlink:type="simple">ftp://ftp.ncbi.nih.gov/blast</ext-link>). We also made use of TBLASTN capabilities to detect stop codons and, importantly, sequence frameshifts.</p><p>We used repeat-masked genomic sequence data NCBI35 (human) and NCBIm33 (mouse) downloaded in January 2005 from the Ensembl database version 27. The Expasy protein database (sprot44, human_trembl and rodent_trembl) was used to assemble protein sequence sets for the two species. We used the Inparanoid [<xref ref-type="bibr" rid="pcbi-0020046-b017">17</xref>] database to retrieve orthologous protein pairs, by selecting from each set of inparalogs the sequence with highest score.</p><p>The mutual-best-hit filtering was performed for each pair of orthologous proteins, by aligning each pair of pseudogenes from the respective species, using bl2seq (from the BLAST package) and then selecting the pair with the best score. We aligned the thereby obtained quartets using the Dialign package [<xref ref-type="bibr" rid="pcbi-0020046-b029">29</xref>], and we extracted from the Dialign output gap-free column triplets based on the reading frame induced by the genes. By using a <italic>local</italic> alignment program we reduced the risk of misalignments caused by introns in the pseudogenes. </p><p>To select the scenario (<italic>S</italic>1, <italic>S</italic>2, <italic>S</italic>3, <italic>S</italic>4) that best describes a given quartet, we adapted the method outlined in [<xref ref-type="bibr" rid="pcbi-0020046-b030">30</xref>]. We work with a model describing the instantaneous substitution rate from codon <italic>i</italic> to codon <italic>j, q<sub>ij</sub></italic>, given that the equilibrium frequency of codon <italic>j</italic> is <italic>π<sub>j</sub></italic>. In their model, the substitution rates are specified by the instantaneous rate matrix <italic>Q</italic> = {<italic>q<sub>ij</sub></italic>} defined by:
				<disp-formula id="pcbi-0020046-e006"><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0020046.e006" xlink:type="simple"/><!-- <mml:math display='block'><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&equals;</mml:mo><mml:mrow><mml:mo stretchy='true'>&lcub;</mml:mo><mml:mtable><mml:mtr columnalign='left'><mml:mtd><mml:mn>0</mml:mn><mml:mtext>,&thinsp;</mml:mtext></mml:mtd><mml:mtd columnalign='left'><mml:mtext>if&thinsp;</mml:mtext><mml:mi>i</mml:mi><mml:mtext>&thinsp;and&thinsp;</mml:mtext><mml:mi>j</mml:mi><mml:mtext>&thinsp;differ at more than one position</mml:mtext></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd></mml:mtd><mml:mtd><mml:mtext>in a codon triplet</mml:mtext></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd><mml:mi>&mu;</mml:mi><mml:msub><mml:mi>&pi;</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mtext>,&thinsp;</mml:mtext></mml:mtd><mml:mtd><mml:mtext>differ by asynonymous transversion</mml:mtext></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd><mml:mi>&mu;</mml:mi><mml:mi>&kappa;</mml:mi><mml:msub><mml:mi>&pi;</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mtext>,&thinsp;</mml:mtext></mml:mtd><mml:mtd><mml:mtext>differ by asynonymous transition</mml:mtext></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd><mml:mi>&mu;</mml:mi><mml:mi>&omega;</mml:mi><mml:msub><mml:mi>&pi;</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mtext>,&thinsp;</mml:mtext></mml:mtd><mml:mtd><mml:mtext>differ by anonsynonymous transversion</mml:mtext></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd><mml:mi>&mu;</mml:mi><mml:mi>&omega;</mml:mi><mml:mi>&kappa;</mml:mi><mml:msub><mml:mi>&pi;</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mtext>,&thinsp;</mml:mtext></mml:mtd><mml:mtd><mml:mtext>differ by anonsynonymous transition</mml:mtext></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:math> --></disp-formula>where <italic>μ</italic> is a normalizing rate factor, <italic>κ</italic> is the transition/transversion ratio, and ω is the nonsynonymous-to-synonymous ratio. In our model, we use one matrix, <italic>Q<sub>g</sub>,</italic> for the parts of the tree where the sequences are supposed to evolve as a gene, and another matrix, <italic>Q<sub>ψ</sub>,</italic> for the parts of the tree where the sequences are supposed to evolve as a pseudogene. The difference between the matrices is that different <italic>κ</italic> are used, for pseudogenes <italic>ω</italic> equals one, for genes we do not allow transitions to stop codons, and equilibrium frequencies are estimated from gene or pseudogene sequences, respectively.
			</p><p>We used the nonparametric version of the Kishino–Hasegawa bootstrap test with 1,000 bootstraps to obtain <italic>p</italic>-values for scenario support [<xref ref-type="bibr" rid="pcbi-0020046-b031">31</xref>].</p><sec id="s4a"><title>Synteny evaluation.</title><p>To infer whether two pseudogenes are in synteny, we used synteny maps from [<xref ref-type="bibr" rid="pcbi-0020046-b032">32</xref>]. Maps based on synteny blocks of a minimum size of 300 kB were downloaded from <ext-link ext-link-type="uri" xlink:href="http://www.cse.ucsd.edu/groups/bioinformatics/GRIMM" xlink:type="simple">http://www.cse.ucsd.edu/groups/bioinformatics/GRIMM</ext-link>. Pairs of pseudogenes were annotated according to: 1) syntenic, i.e., the two sequences originate from syntenic regions; 2) reversed syntenic, the same as above, but their mutual orientation differs from the main orientation of the synteny blocks; 3) close to synteny, the sequence from one species is found in a block adjacent to the syntenic block in the other species; 4) undetermined synteny, one or both sequences originate from positions which are not mapped to any synteny block; and 5) not in synteny.</p><p>Synteny relations were established for the 7,244 out of 12,678 genes for which gene position data was available.</p></sec><sec id="s4b"><title>Gene expression evaluation.</title><p>To find transcription evidence we applied a reciprocal BLAST-based methodology to databases of ESTs and mRNAs. The EST-human, EST-mouse, and Unigene mRNA databases were downloaded from NCBI. Any reciprocal best hit longer than 100 bp and with more than 99% sequence identity to the query sequence was retrieved.</p></sec><sec id="s4c"><title>Hoeffding's bound for calculation of conservation <italic>p</italic>-value.</title><p>According to Hoeffding's theorem [<xref ref-type="bibr" rid="pcbi-0020046-b033">33</xref>], the following holds: given a set of <italic>n</italic> Poisson trials <italic>X<sub>i</sub>,</italic> each taking value one with probability <italic>p<sub>i</sub>,</italic> and <italic>X =</italic> ∑ <italic>X<sub>i</sub>,</italic> with expectation <italic>E</italic>[<italic>X</italic>] = <italic>np,</italic> it holds that <italic>Pr</italic>(<italic>X ≤ c</italic>) ≤ <italic>Bin</italic>(<italic>n,p,c</italic>) for any 0 ≤ c ≤ (<italic>np</italic> − 1).</p><p>Given an alignment of length <italic>n,</italic> the theorem can be used to calculate a <italic>p</italic>-value for ≥ <italic>c/n</italic> matching residues based on the hypothesis that the alignment is generated from a (background) distribution with mismatch probability <italic>p</italic>.</p></sec></sec></body><back><ack><p>We thank Henrik Kaessman, Per Svensson, and three anonymous reviewers for their valuable comments on the manuscript; Ali Tofigh, Johannes Frey-Skött, and Samuel Andersson for constructive discussions; and the Center for Parallel Computers for computational support.</p></ack><glossary><title>Abbreviations</title><def-list><def-item><term>ATX1</term><def><p>Spinocerebellar ataxia type 1 protein</p></def></def-item><def-item><term>ATX7NL3</term><def><p>Ataxin 7-like 3</p></def></def-item><def-item><term>EST</term><def><p>expressed-sequence tags</p></def></def-item></def-list></glossary><ref-list><title>References</title><ref id="pcbi-0020046-b001"><label>1</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Vanin</surname><given-names>EF</given-names></name></person-group>
					<year>1985</year>
					<article-title>Processed pseudogenes: Characteristics and evolution.</article-title>
					<source>Annu Rev Genet</source>
					<volume>19</volume>
					<fpage>253</fpage>
					<lpage>272</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b002"><label>2</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Mighell</surname><given-names>AJ</given-names></name><name name-style="western"><surname>Smith</surname><given-names>NR</given-names></name><name name-style="western"><surname>Robinson</surname><given-names>PA</given-names></name><name name-style="western"><surname>Markham</surname><given-names>AF</given-names></name></person-group>
					<year>2000</year>
					<article-title>Vertebrate pseudogenes.</article-title>
					<source>FEBS Lett</source>
					<volume>468</volume>
					<fpage>109</fpage>
					<lpage>114</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b003"><label>3</label><element-citation publication-type="journal" xlink:type="simple">
					<collab xlink:type="simple">Graur D, Shuali Y, Li</collab>
					<year>1989</year>
					<article-title>Deletions in processed pseudogenes accummulate faster in rodents than in humans.</article-title>
					<source>J Mol Evol</source>
					<volume>28</volume>
					<fpage>279</fpage>
					<lpage>285</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b004"><label>4</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Zhang</surname><given-names>Z</given-names></name><name name-style="western"><surname>Gerstein</surname><given-names>M</given-names></name></person-group>
					<year>2003</year>
					<article-title>Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes.</article-title>
					<source>Nucleic Acids Res</source>
					<volume>31</volume>
					<fpage>5338</fpage>
					<lpage>5348</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b005"><label>5</label><mixed-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Hirotsune </surname><given-names>S</given-names></name><name name-style="western"><surname>Yoshida</surname><given-names>N</given-names></name><name name-style="western"><surname>Chen</surname><given-names>A</given-names></name><name name-style="western"><surname>Garrett</surname><given-names>L</given-names></name><name name-style="western"><surname>Sugiyama</surname><given-names>F</given-names></name><etal/></person-group>&gt;
					<year>2003</year>
					<article-title>An expressed pseudogene regulates the messenger-RNA stability of its homologous coding gene.</article-title>
					<source>Nature</source>
					<volume>423</volume>
					<fpage>91</fpage>
					<lpage>96</lpage>
				</mixed-citation></ref><ref id="pcbi-0020046-b006"><label>6</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Balakirev</surname><given-names>ES</given-names></name><name name-style="western"><surname>Ayala</surname><given-names>FJ</given-names></name></person-group>
					<year>2003</year>
					<article-title>Pseudogenes: Are they “junk” or functional DNA?</article-title>
					<source>Annu Rev Genet</source>
					<volume>37</volume>
					<fpage>123</fpage>
					<lpage>151</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b007"><label>7</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Podlaha</surname><given-names>O</given-names></name><name name-style="western"><surname>Zhang</surname><given-names>J</given-names></name></person-group>
					<year>2004</year>
					<article-title>Nonneutral evolution of the transcribed pseudogene Makorin1-p1 in mice.</article-title>
					<source>Mol Biol Evol</source>
					<volume>21</volume>
					<fpage>2202</fpage>
					<lpage>2209</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b008"><label>8</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Zhang</surname><given-names>Z</given-names></name><name name-style="western"><surname>Harrison</surname><given-names>PM</given-names></name><name name-style="western"><surname>Liu</surname><given-names>Y</given-names></name><name name-style="western"><surname>Gerstein</surname><given-names>M</given-names></name></person-group>
					<year>2003</year>
					<article-title>Millions of years of evolution preserved: A comprehensive catalog of the processed pseudogenes in the human genome.</article-title>
					<source>Genome Res</source>
					<volume>13</volume>
					<fpage>2541</fpage>
					<lpage>2558</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b009"><label>9</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Torrents</surname><given-names>D</given-names></name><name name-style="western"><surname>Suyama</surname><given-names>M</given-names></name><name name-style="western"><surname>Zdobnov</surname><given-names>E</given-names></name><name name-style="western"><surname>Bork</surname><given-names>P</given-names></name></person-group>
					<year>2003</year>
					<article-title>A genome-wide survey of human pseudogenes.</article-title>
					<source>Genome Res</source>
					<volume>13</volume>
					<fpage>2559</fpage>
					<lpage>2567</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b010"><label>10</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Zhang</surname><given-names>Z</given-names></name><name name-style="western"><surname>Carriero</surname><given-names>N</given-names></name><name name-style="western"><surname>Gerstein</surname><given-names>M</given-names></name></person-group>
					<year>2004</year>
					<article-title>Comparative analysis of processed pseudogenes in the mouse and human genomes.</article-title>
					<source>Trends Genet</source>
					<volume>20</volume>
					<fpage>62</fpage>
					<lpage>67</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b011"><label>11</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Ohshima</surname><given-names>K</given-names></name><name name-style="western"><surname>Masahira</surname><given-names>H</given-names></name><name name-style="western"><surname>Yada</surname><given-names>T</given-names></name><name name-style="western"><surname>Gojobori</surname><given-names>T</given-names></name><name name-style="western"><surname>Sakaki</surname><given-names>Y</given-names></name><etal/></person-group>
					<year>2003</year>
					<article-title>Whole-genome screening indicates a possible burst of formation of processed pseudogenes and Alu repeats by particular L1 subfamilies in ancestral primates.</article-title>
					<source>Genome Biol</source>
					<volume>4</volume>
					<fpage>R74</fpage>
				</element-citation></ref><ref id="pcbi-0020046-b012"><label>12</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Harrison</surname><given-names>PM</given-names></name><name name-style="western"><surname>Zheng</surname><given-names>D</given-names></name><name name-style="western"><surname>Zhang</surname><given-names>Z</given-names></name><name name-style="western"><surname>Carriero</surname><given-names>N</given-names></name><name name-style="western"><surname>Gerstein</surname><given-names>M</given-names></name></person-group>
					<year>2005</year>
					<article-title>Transcribed processed pseudogenes in the human genome: An intermediate form of expressed retrosequence lacking protein-coding ability.</article-title>
					<source>Nucleic Acids Res</source>
					<volume>33</volume>
					<fpage>2374</fpage>
					<lpage>2383</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b013"><label>13</label><element-citation publication-type="journal" xlink:type="simple">
					<collab xlink:type="simple">Zheng D, Zhang Z, Harrison P, Karro J, Carriero N, et al.</collab>
					<year>2005</year>
					<article-title>Integrated pseudogene annotation for human chromosome 22: Evidence for transcription.</article-title>
					<source>J Mol Biol</source>
					<volume>349</volume>
					<fpage>27</fpage>
					<lpage>45</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b014"><label>14</label><element-citation publication-type="journal" xlink:type="simple">
					<collab xlink:type="simple">Yano Y, Saito R, Yoshida N, Yoshiki A, Wynshaw-Boris A, et al.</collab>
					<year>2004</year>
					<article-title>A new role for expressed pseudogenes as ncRNA: Regulation of mRNA stability of its homologous coding gene.</article-title>
					<source>J Mol Med</source>
					<volume>82</volume>
					<fpage>414</fpage>
					<lpage>422</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b015"><label>15</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Fleishman</surname><given-names>SJ</given-names></name><name name-style="western"><surname>Dagan</surname><given-names>T</given-names></name><name name-style="western"><surname>Graur</surname><given-names>D</given-names></name></person-group>
					<year>2003</year>
					<article-title>pANT: A method for the pairwise assessment of nonfunctionalization times of processed pseudogenes.</article-title>
					<source>Mol Biol Evol</source>
					<volume>20</volume>
					<fpage>1876</fpage>
					<lpage>1880</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b016"><label>16</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Elhaik</surname><given-names>E</given-names></name><name name-style="western"><surname>Sabath</surname><given-names>N</given-names></name><name name-style="western"><surname>Graur</surname><given-names>D</given-names></name></person-group>
					<year>2006</year>
					<article-title>The “inverse relationship between evolutionary rate and age of mammalian genes” is an artifact of increased genetic distance with rate of evolution and time of divergence.</article-title>
					<source>Mol Biol Evol</source>
					<volume>23</volume>
					<fpage>1</fpage>
					<lpage>3</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b017"><label>17</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Remm</surname><given-names>M</given-names></name><name name-style="western"><surname>Storm</surname><given-names>CE</given-names></name><name name-style="western"><surname>Sonnhammer</surname><given-names>EL</given-names></name></person-group>
					<year>2001</year>
					<article-title>Automatic clustering of orthologs and in-paralogs from pairwise species comparisons.</article-title>
					<source>J Mol Biol</source>
					<volume>314</volume>
					<fpage>1041</fpage>
					<lpage>1052</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b018"><label>18</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Pevzner</surname><given-names>P</given-names></name><name name-style="western"><surname>Tesler</surname><given-names>G</given-names></name></person-group>
					<year>2003</year>
					<article-title>Genome rearrangements in mammalian evolution: Lessons from human and mouse genomes.</article-title>
					<source>Genome Res</source>
					<volume>13</volume>
					<fpage>37</fpage>
					<lpage>45</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b019"><label>19</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Korf</surname><given-names>I</given-names></name><name name-style="western"><surname>Flicek</surname><given-names>P</given-names></name><name name-style="western"><surname>Duan</surname><given-names>D</given-names></name><name name-style="western"><surname>Brent</surname><given-names>MR</given-names></name></person-group>
					<year>2001</year>
					<article-title>Integrating genomic homology into gene structure prediction.</article-title>
					<source>Bioinformatics</source>
					<volume>17</volume>
					<issue>(Supplement 1)</issue>
					<fpage>140</fpage>
					<lpage>148</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b020"><label>20</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Malcom</surname><given-names>CM</given-names></name><name name-style="western"><surname>Wyckoff</surname><given-names>GJ</given-names></name><name name-style="western"><surname>Lawn</surname><given-names>BT</given-names></name></person-group>
					<year>2003</year>
					<article-title>Genic mutation rates in mammals: Local similarity, chromosomal heterogeneity, and X-versus-autosome disparity.</article-title>
					<source>Mol Biol Evol</source>
					<volume>20</volume>
					<fpage>1633</fpage>
					<lpage>1641</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b021"><label>21</label><element-citation publication-type="journal" xlink:type="simple">
					<collab xlink:type="simple">Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, et al.</collab>
					<year>2002</year>
					<article-title>Initial sequencing and comparative analysis of the mouse genome.</article-title>
					<source>Nature</source>
					<volume>420</volume>
					<fpage>520</fpage>
					<lpage>562</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b022"><label>22</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Britten</surname><given-names>RJ</given-names></name></person-group>
					<year>2002</year>
					<article-title>Divergence between samples of chimpanzee and human DNA sequences is 5%, counting indels.</article-title>
					<source>Proc Natl Acad Sci U S A</source>
					<volume>99</volume>
					<fpage>13633</fpage>
					<lpage>13635</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b023"><label>23</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Wildman</surname><given-names>DE</given-names></name><name name-style="western"><surname>Uddin</surname><given-names>M</given-names></name><name name-style="western"><surname>Guozhen</surname><given-names>L</given-names></name><name name-style="western"><surname>Grossman</surname><given-names>L</given-names></name><name name-style="western"><surname>Goodman</surname><given-names>M</given-names></name></person-group>
					<year>2003</year>
					<article-title>Implications of natural selection in shaping 99.4% nonsynonymous DNA identity between humans and chimpanzees: Enlarging genus Homo.</article-title>
					<source>Proc Natl Acad Sci U S A</source>
					<volume>100</volume>
					<fpage>7181</fpage>
					<lpage>7188</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b024"><label>24</label><element-citation publication-type="journal" xlink:type="simple">
					<collab xlink:type="simple">Orr HT, Chung MY, Banfi S, Kwiatkowski TJ Jr, Servadio A, et al.</collab>
					<year>1993</year>
					<article-title>Expansion of an unstable trinucleotide CAG repeat in spinocerebellar ataxia type 1.</article-title>
					<source>Nat Genet</source>
					<volume>4</volume>
					<fpage>221</fpage>
					<lpage>226</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b025"><label>25</label><element-citation publication-type="other" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Banfi</surname><given-names>S</given-names></name><name name-style="western"><surname>Servadio</surname><given-names>A</given-names></name><name name-style="western"><surname>Chung</surname><given-names>MY</given-names></name><name name-style="western"><surname>Kwiatkowski</surname><given-names>TJ</given-names><suffix>Jr</suffix></name><name name-style="western"><surname>McCall</surname><given-names>AE</given-names></name><etal/></person-group>
					<article-title>Identification and characterization of the gene causing type 1 spinocerebellar ataxia.</article-title>
					<source>Nat Genet</source>
					<volume>7</volume>
					<fpage>513</fpage>
					<lpage>520</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b026"><label>26</label><element-citation publication-type="other" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Tsai</surname><given-names>CC</given-names></name><name name-style="western"><surname>Kao</surname><given-names>HY</given-names></name><name name-style="western"><surname>Mitzutani</surname><given-names>A</given-names></name><name name-style="western"><surname>Banayo</surname><given-names>E</given-names></name><name name-style="western"><surname>Rajan</surname><given-names>H</given-names></name><etal/></person-group>
					<article-title>Ataxin 1, a SCA1 neurodegenerative disorder protein, is functionally linked to the silencing mediator of retinoid and thyroid hormone receptors.</article-title>
					<source>Proc Natl Acad Sci U S A</source>
					<volume>101</volume>
					<fpage>4047</fpage>
					<lpage>4052</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b027"><label>27</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Nemes</surname><given-names>JP</given-names></name><name name-style="western"><surname>Benzow</surname><given-names>KA</given-names></name><name name-style="western"><surname>Koob</surname><given-names>MD</given-names></name></person-group>
					<year>2000</year>
					<article-title>The SCA8 transcript is an antisense RNA to a brain-specific transcript encoding a novel actin-binding protein (KLHL1).</article-title>
					<source>Hum Mol Genet</source>
					<volume>9</volume>
					<fpage>1543</fpage>
					<lpage>1551</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b028"><label>28</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Ström</surname><given-names>AL</given-names></name><name name-style="western"><surname>Forsgren</surname><given-names>L</given-names></name><name name-style="western"><surname>Holmberg</surname><given-names>M</given-names></name></person-group>
					<year>2005</year>
					<article-title>A role for both wild-type and expanded ataxin-7 in transcriptional regulation.</article-title>
					<source>Neurobiol Dis</source>
					<volume>20</volume>
					<fpage>646</fpage>
					<lpage>655</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b029"><label>29</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Morgenstern</surname><given-names>B</given-names></name></person-group>
					<year>1999</year>
					<article-title>DIALIGN 2: Improvement of the segment-to-segment approach to multiple sequence alignment.</article-title>
					<source>Bioinformatics</source>
					<volume>15</volume>
					<fpage>211</fpage>
					<lpage>218</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b030"><label>30</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Bielawski</surname><given-names>JP</given-names></name><name name-style="western"><surname>Yang</surname><given-names>Z</given-names></name></person-group>
					<year>2003</year>
					<article-title>Maximum likelihood methods for detecting adaptive evolution after gene duplication.</article-title>
					<source>J Struct Funct Genomics</source>
					<volume>3</volume>
					<fpage>201</fpage>
					<lpage>212</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b031"><label>31</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Goldman</surname><given-names>N</given-names></name><name name-style="western"><surname>Anderson</surname><given-names>JP</given-names></name><name name-style="western"><surname>Rodrigo</surname><given-names>AG</given-names></name></person-group>
					<year>2000</year>
					<article-title>Likelihood-based tests of topologies in phylogenetics.</article-title>
					<source>Syst Biol</source>
					<volume>49</volume>
					<fpage>652</fpage>
					<lpage>670</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b032"><label>32</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Bourque</surname><given-names>G</given-names></name><name name-style="western"><surname>Pevzner</surname><given-names>P</given-names></name><name name-style="western"><surname>Tesler</surname><given-names>G</given-names></name></person-group>
					<year>2004</year>
					<article-title>Reconstructing the genomic architecture of ancestral mammals: Lessons from human, mouse, and rat genomes.</article-title>
					<source>Genome Res</source>
					<volume>14</volume>
					<fpage>507</fpage>
					<lpage>516</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b033"><label>33</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Hoeffding</surname><given-names>W</given-names></name></person-group>
					<year>1956</year>
					<article-title>On the distribution of the number of successes in independent trials.</article-title>
					<source>Ann Math Stat</source>
					<volume>27</volume>
					<fpage>713</fpage>
					<lpage>721</lpage>
				</element-citation></ref><ref id="pcbi-0020046-b034"><label>34</label><element-citation publication-type="journal" xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Beitz</surname><given-names>E</given-names></name></person-group>
					<year>2000</year>
					<article-title>TeXshade: Shading and labeling of multiple sequence alignments using LaTeX2e.</article-title>
					<source>Bioinformatics</source>
					<volume>16</volume>
					<fpage>135</fpage>
					<lpage>139</lpage>
				</element-citation></ref></ref-list></back></article>