<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article
  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article" dtd-version="3.0" xml:lang="EN">
  <front>
    <journal-meta><journal-id journal-id-type="publisher-id">plos</journal-id><journal-id journal-id-type="publisher">pcbi</journal-id><journal-id journal-id-type="nlm-ta">PLoS Comput Biol</journal-id><journal-id journal-id-type="pmc">ploscomp</journal-id><!--===== Grouping journal title elements =====--><journal-title-group><journal-title>PLoS Computational Biology</journal-title></journal-title-group><issn pub-type="ppub">1553-734X</issn><issn pub-type="epub">1553-7358</issn><publisher>
        <publisher-name>Public Library of Science</publisher-name>
      </publisher></journal-meta>
    <article-meta><article-id pub-id-type="doi">10.1371/journal.pcbi.0010010</article-id><article-id pub-id-type="publisher-id">05-PLCB-RA-0018R2</article-id><article-id pub-id-type="sici">plcb-01-01-08</article-id><article-categories>
        <subj-group subj-group-type="heading">
          <subject>Research Article</subject>
        </subj-group>
        <subj-group subj-group-type="Discipline">
          <subject>Computational Biology</subject>
        </subj-group>
        <subj-group subj-group-type="System Taxonomy">
          <subject>None</subject>
        </subj-group>
      </article-categories><title-group><article-title>Extraction of Transcript Diversity from Scientific Literature</article-title><alt-title alt-title-type="running-head">Text Mining for Alternative Transcripts</alt-title></title-group><contrib-group>
        <contrib contrib-type="author" xlink:type="simple">
          <name name-style="western">
            <surname>Shah</surname>
            <given-names>Parantu K</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">
            <sup>1</sup>
          </xref>
          <xref ref-type="aff" rid="aff2">
            <sup>2</sup>
          </xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <name name-style="western">
            <surname>Jensen</surname>
            <given-names>Lars J</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">
            <sup>1</sup>
          </xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <name name-style="western">
            <surname>Boué</surname>
            <given-names>Stéphanie</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">
            <sup>1</sup>
          </xref>
        </contrib>
        <contrib contrib-type="author" xlink:type="simple">
          <name name-style="western">
            <surname>Bork</surname>
            <given-names>Peer</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">
            <sup>1</sup>
          </xref>
          <xref ref-type="aff" rid="aff2">
            <sup>2</sup>
          </xref>
          <xref ref-type="fn" rid="cor1">
            <sup>*</sup>
          </xref>
        </contrib>
      </contrib-group><aff id="aff1">
				<label>1</label><addr-line> Structural and Computational Biology Program, European Molecular Biology Laboratory, Heidelberg, Germany,
			</addr-line></aff><aff id="aff2">
				<label>2</label><addr-line> Max Delbrück Centre for Molecular Medicine, Berlin-Buch, Germany
			</addr-line></aff><contrib-group>
        <contrib contrib-type="editor" xlink:type="simple">
          <name name-style="western">
            <surname>Bourne</surname>
            <given-names>Philip</given-names>
          </name>
          <role>Editor</role>
          <xref ref-type="aff" rid="edit1"/>
        </contrib>
      </contrib-group><aff id="edit1">University of California at San Diego, United States of America</aff><author-notes>
        <fn fn-type="con" id="ack2">
          <p>PKS, LJJ, and PB conceived and designed the experiments. PKS performed the experiments. PKS and SB analyzed the data. PKS contributed reagents/materials/analysis tools. PKS, LJJ, and PB wrote the paper.</p>
        </fn>
        <corresp id="cor1">* To whom correspondence should be addressed. E-mail: <email xlink:type="simple">bork@embl.de</email></corresp>
      <fn fn-type="conflict" id="ack1">
        <p> The authors have declared that no competing interests exist.</p>
      </fn></author-notes><pub-date pub-type="ppub">
        <month>6</month>
        <year>2005</year>
      </pub-date><pub-date pub-type="epub">
        <day>24</day>
        <month>6</month>
        <year>2005</year>
      </pub-date><volume>1</volume><issue>1</issue><elocation-id>e10</elocation-id><history>
        <date date-type="received">
          <day>1</day>
          <month>2</month>
          <year>2005</year>
        </date>
        <date date-type="accepted">
          <day>21</day>
          <month>5</month>
          <year>2005</year>
        </date>
      </history><!--===== Grouping copyright info into permissions =====--><permissions><copyright-year>2005</copyright-year><copyright-holder>Shah et al</copyright-holder><license><license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p></license></permissions><abstract>
        <p>Transcript diversity generated by alternative splicing and associated mechanisms contributes heavily to the functional complexity of biological systems. The numerous examples of the mechanisms and functional implications of these events are scattered throughout the scientific literature. Thus, it is crucial to have a tool that can automatically extract the relevant facts and collect them in a knowledge base that can aid the interpretation of data from high-throughput methods. We have developed and applied a composite text-mining method for extracting information on transcript diversity from the entire MEDLINE database in order to create a database of genes with alternative transcripts. It contains information on tissue specificity, number of isoforms, causative mechanisms, functional implications, and experimental methods used for detection. We have mined this resource to identify 959 instances of tissue-specific splicing. Our results in combination with those from EST-based methods suggest that alternative splicing is the preferred mechanism for generating transcript diversity in the nervous system. We provide new annotations for 1,860 genes with the potential for generating transcript diversity. We assign the MeSH term “alternative splicing” to 1,536 additional abstracts in the MEDLINE database and suggest new MeSH terms for other events. We have successfully extracted information about transcript diversity and semiautomatically generated a database, LSAT, that can provide a quantitative understanding of the mechanisms behind tissue-specific gene expression. LSAT (Literature Support for Alternative Transcripts) is publicly available at <ext-link ext-link-type="uri" xlink:href="http://www.bork.embl.de/LSAT/" xlink:type="simple">http://www.bork.embl.de/LSAT/</ext-link>.</p>
      </abstract><abstract abstract-type="alternate">
        <title>Synopsis</title>
        <sec id="st1">
          <title/>
          <p>Given the functional complexity of higher eukaryotes, the relatively small number of genes in the human and other mammalian genomes came as a surprise to the scientific community. Later it was discovered that the majority of genes are subject to alternative splicing (“cutting and pasting”) or associated mechanisms that ultimately increase the diversity of transcripts that code for proteins. Studies exploring transcript diversity are currently dominated by high-throughput experiments and computational methods; however, the quality of such data should be assessed against a reliable reference set based on single-gene studies. Unfortunately, the latter type of information is scattered throughout the scientific literature. The authors have thus developed a computational approach for extracting information on alternative transcripts from MEDLINE abstracts and used it to create a database, LSAT. LSAT (Literature Support for Alternative Transcripts) provides information for more than 4,000 genes from about 14,000 abstracts. This database can provide a quantitative understanding of the mechanisms behind tissue-specific gene expression based on single-gene studies, which we show agrees well with EST-based studies (these studies involve tissue-specific splicing detected by the analysis of libraries of expressed sequence tags [ESTs]). These results indicate that mechanisms like alternative splicing, alternative promoters, and alternative polyadenylation work in concert to generate and regulate transcript diversity. More generally, information extraction of complex biological process seems feasible and can also complement large-scale data generation in other areas to assign functions to genes.</p>
        </sec>
      </abstract><counts>
        <page-count count="7"/>
      </counts><!--===== Restructure custom-meta-wrap to custom-meta-group =====--><custom-meta-group>
        <custom-meta>
          <meta-name>Citation:</meta-name>
          <meta-value>Shah PK, Jensen LJ, Boué S, Bork P (2005) Extraction of transcript diversity from scientific literature. PLoS Comp Biol 1(1): e10.</meta-value>
        </custom-meta>
      </custom-meta-group></article-meta>
  </front>
  <body>
    <sec id="s1">
      <title>Introduction</title>
      <p>Although many model organisms have now been completely sequenced, we are still very far from understanding cellular function from genome sequence. One complicating factor is the expression of multiple alternative mRNA transcripts from a single gene using different mechanisms. Alternative promoters that are active in different tissues or at different developmental stages often regulate the expression of different mRNA isoforms, either directly through different transcription start sites or indirectly by promoter-directed exon inclusion in concert with alternative splicing (AS) [<xref ref-type="bibr" rid="pcbi-0010010-b01">1</xref>]. Various AS mechanisms are known: alternative 5′ or 3′ sites can result in exons of different size, exons can be included or skipped, or an entire intron may be retained [<xref ref-type="bibr" rid="pcbi-0010010-b02">2</xref>–<xref ref-type="bibr" rid="pcbi-0010010-b05">5</xref>]. Alternative polyadenylation (AP), either alone or coupled with AS of 3′ terminal exons, may also generate transcript isoforms that are tissue- or developmental-stage-specific [<xref ref-type="bibr" rid="pcbi-0010010-b06">6</xref>].</p>
      <p>Generation of multiple alternative transcripts is important for the complexity and evolution of eukaryotic organisms [<xref ref-type="bibr" rid="pcbi-0010010-b05">5</xref>,<xref ref-type="bibr" rid="pcbi-0010010-b07">7</xref>–<xref ref-type="bibr" rid="pcbi-0010010-b09">9</xref>]. In addition, their spatial and temporal expression patterns are believed to be one of the important factors behind the functional specificity of different tissues and organs. Moreover, defects in these processes are associated with various diseases [<xref ref-type="bibr" rid="pcbi-0010010-b02">2</xref>]. Thus, developing an exhaustive catalog of alternative transcripts is a crucial task in order to fully understand the complexity of eukaryotes [<xref ref-type="bibr" rid="pcbi-0010010-b07">7</xref>].</p>
      <p>At present, high-throughput experiments and computational analyses dominate the mapping of the alternative transcript universe [<xref ref-type="bibr" rid="pcbi-0010010-b10">10</xref>,<xref ref-type="bibr" rid="pcbi-0010010-b11">11</xref>]. However, the quality and the biological meaning of these assignments should be assessed against a highly reliable benchmark set, which can be extracted from single-gene studies published in the scientific literature [<xref ref-type="bibr" rid="pcbi-0010010-b03">3</xref>,<xref ref-type="bibr" rid="pcbi-0010010-b12">12</xref>,<xref ref-type="bibr" rid="pcbi-0010010-b13">13</xref>]. In addition, computational tools to explore the evolutionary conservation of mechanisms that generate transcript diversity (TD) are under development [<xref ref-type="bibr" rid="pcbi-0010010-b14">14</xref>], which will also require a trustworthy set for rule learning.</p>
      <p>Manual curation of experimentally determined biological events (physical interactions, AS, disease phenotypes, etc.) to generate trustworthy knowledge bases is slow compared to the rapid increase in the body of knowledge represented in the literature. Natural language processing tools thus play an increasingly important role in transferring information from free-form biomedical text to structured databases (see reviews [<xref ref-type="bibr" rid="pcbi-0010010-b15">15</xref>–<xref ref-type="bibr" rid="pcbi-0010010-b18">18</xref>]). This task can be split in to two steps: (1) a subset of documents describing events or scenarios of interest is identified (information retrieval [IR]), and (2) facts are extracted from these documents and deposited into structured fields (information extraction [IE]).</p>
      <p>IR can be performed at the level of full articles, pertinent paragraphs, or sentences. As current IE methods operate at the sentence level, it may be appropriate to perform IR at the same level. Support vector machines have become the method of choice for IR tasks because of their ability to learn patterns and generalize well while handling large sets of input features, a common attribute of the text data [<xref ref-type="bibr" rid="pcbi-0010010-b19">19</xref>–<xref ref-type="bibr" rid="pcbi-0010010-b21">21</xref>]. Most IE systems use rules written by the domain experts to extract facts about events or scenarios of interest. The performance of most rule-based systems suffers because of the fact that any event or scenario can be written in one of many syntactically correct ways. Thus, an extraction system based only on syntactic patterns would require an exhaustive collection of rules in order to cover all possible patterns. The problem posed by multiple syntactic patterns can be solved by merging multiple syntactic patterns to a single semantic pattern by predicate–argument structures [<xref ref-type="bibr" rid="pcbi-0010010-b22">22</xref>–<xref ref-type="bibr" rid="pcbi-0010010-b24">24</xref>]. Predicate–argument structures and support vector machines (SVMs) are becoming prevalent in natural language processing and are widely believed to achieve good recall and precision; they were tested here for their applicability to the biomedical literature.</p>
      <p>Here we present the benchmark and the results of a new extraction procedure that combines an SVM classifier with rule-based extraction of semantic patterns. The extracted knowledge about TD was stored in a database and subsequently used to quantify the amount of TD in different tissues. We discuss applications of our work for the assignment of MeSH terms (from the National Library of Medicine's Medical Subject Headings thesaurus), providing functional annotations to genes and to the transcript variants generated by computational methods.</p>
    </sec>
    <sec id="s2">
      <title>Results/Discussion</title>
      <sec id="s2a">
        <title>Overall Strategy and Generation of the Database</title>
        <p>To extract information about TD and associated spatiotemporal information scattered throughout MEDLINE, we devised a two-step procedure (<xref ref-type="fig" rid="pcbi-0010010-g001">Figure 1</xref>). In the first step, sentences containing TD information were identified within the papers' abstracts. To do so, and in order to overcome the problem of syntactic patterns, we tan SVM classifier for the sentence classification task by inductive machine learning [<xref ref-type="bibr" rid="pcbi-0010010-b25">25</xref>] on an annotated corpus [<xref ref-type="bibr" rid="pcbi-0010010-b19">19</xref>–<xref ref-type="bibr" rid="pcbi-0010010-b21">21</xref>]. We then processed the entire MEDLINE database and identified sentences describing TD within those abstracts. In the second step, sentences were parsed and the word phrases were assigned different meaningful (semantic) categories (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>).</p>
        <fig id="pcbi-0010010-g001" position="float">
          <object-id pub-id-type="doi">10.1371/journal.pcbi.0010010.g001</object-id>
          <label>Figure 1</label>
          <caption>
            <title>Creating Specialized Databases for Events of Interest</title>
            <p>A database of physiologically occurring AS events can be generated in two steps. Each step may involve machine learning or rule-based methods. The first step involves the identification of sentences from scientific text. These sentences can be parsed in a second step to extract frequently occurring semantic patterns.</p>
          </caption>
          <graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0010010.g001" xlink:type="simple"/>
        </fig>
        <p>Finally, we mapped each abstract with information about alternative transcripts (retrieved by the SVM classifier) to entries in Swiss-Prot [<xref ref-type="bibr" rid="pcbi-0010010-b26">26</xref>], RefSeq [<xref ref-type="bibr" rid="pcbi-0010010-b27">27</xref>], GenBank [<xref ref-type="bibr" rid="pcbi-0010010-b28">28</xref>], and Ensembl [<xref ref-type="bibr" rid="pcbi-0010010-b29">29</xref>] databases, when possible. This not only provided the sequence information at genome, transcript, and protein level for the genes described in abstracts but also allowed us to access structural and functional information about these genes stored in various sequence databases. All this information obtained for each MEDLINE entry constitutes an entry in LSAT (<xref ref-type="supplementary-material" rid="pcbi-0010010-sg001">Figure S1</xref>).</p>
        <p>We identified eight different semantic categories describing biologically relevant data in the sentences describing TD, among which are event mechanism, species, tissue specificity, and experimental methods (<xref ref-type="table" rid="pcbi-0010010-t001">Table 1</xref>; see <xref ref-type="sec" rid="s3">Materials and Methods</xref>). In total we extracted 9,503 instances of event mechanisms from as many abstracts (<xref ref-type="supplementary-material" rid="pcbi-0010010-st001">Table S1</xref>) and 5,028 instances of tissues (<xref ref-type="supplementary-material" rid="pcbi-0010010-st002">Table S2</xref>) with associated gene names. Overall, the database contains 3,063, 874, and 207 nonredundant instances of AS, differential promoter usage (DP), and AP associated with genes and tissues extracted by entity taggers.</p>
        <table-wrap content-type="2col" id="pcbi-0010010-t001" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0010010.t001</object-id><label>Table 1</label><caption>
            <p>Extraction of Semantic Patterns</p>
          </caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0010010.t001" xlink:type="simple"/><!--
					<table frame="hsides" rules="none">
						<colgroup>
							<col id="tb1col1" align="left" charoff="0" char=""/>
							<col id="tb1col2" align="left" charoff="0" char=""/>
							<col id="tb1col3" align="left" charoff="0" char=""/>
							<col id="tb1col4" align="left" charoff="0" char=""/>
							<col id="tb1col5" align="left" charoff="0" char=""/>
							<col id="tb1col6" align="left" charoff="0" char=""/>
							<col id="tb1col7" align="left" charoff="0" char=""/>
						</colgroup>
						<thead>
							<tr>
								<td colspan="3"><hr/>Transcription Factor</td>
								<td colspan="2"><hr/>Targets</td>
								<td colspan="2"><hr/>Transcription Factor Role</td>
							</tr>
							<tr>
								<td><hr/>Name</td>
								<td><hr/>Body/Wing ratio<sup>a</sup></td>
								<td><hr/>Body/Wing Enriched</td>
								<td><hr/>Kolmogorov-Smirnov <italic>p</italic>-Value<sup>b</sup></td>
								<td><hr/>Body/Wing Enriched</td>
								<td><hr/>Inferred Function</td>
								<td><hr/>Known Function</td>
							</tr>
						</thead>
						<tbody>
							<tr>
								<td>Stripe</td>
								<td>9.113</td>
								<td>Body</td>
								<td>1.32 &times; 10<sup>&minus;4</sup></td>
								<td>Body</td>
								<td>Activator</td>
								<td>Activator</td>
							</tr>
							<tr>
								<td>EP2237</td>
								<td>2.059</td>
								<td>Body</td>
								<td>5.48 &times; 10<sup>&minus;4</sup></td>
								<td>Body</td>
								<td>Activator</td>
								<td>Activator</td>
							</tr>
							<tr>
								<td>CG10309</td>
								<td>0.321</td>
								<td>Wing</td>
								<td>2.00 &times; 10<sup>&minus;4</sup></td>
								<td>Body</td>
								<td>Repressor</td>
								<td>&mdash;</td>
							</tr>
							<tr>
								<td>CG9895</td>
								<td>0.380</td>
								<td>Wing</td>
								<td>8.45 &times; 10<sup>&minus;3</sup></td>
								<td>Body</td>
								<td>Repressor</td>
								<td>&mdash;</td>
							</tr>
							<tr>
								<td>CG14655</td>
								<td>0.302</td>
								<td>Wing</td>
								<td>3.47 &times; 10<sup>&minus;2</sup></td>
								<td>Wing</td>
								<td>Activator</td>
								<td>&mdash;</td>
							</tr>
						</tbody>
					</table> --></table-wrap>
      </sec>
      <sec id="s2b">
        <title>Performance of the SVM Classifier for Sentence Retrieval</title>
        <p>Our SVM classifier retrieved 31,123 putative TD-containing sentences from the MEDLINE database (12,948,515 abstracts). After false positives were removed by manual curation, 20,549 TD-containing sentences in 13,892 abstracts were left, corresponding to a precision of 66%. Details on the training set and SVM training procedure are described in <xref ref-type="sec" rid="s3">Materials and Methods</xref> and <xref ref-type="supplementary-material" rid="pcbi-0010010-sd001">Protocol S1</xref>.</p>
        <p>We determined the recall of the classifier using manually curated AS annotations from MEDLINE and Swiss-Prot for annotations on human, mouse, rat, and <italic>Drosophila</italic>. All entries from MEDLINE 2004 annotated with the MeSH term “alternative splicing” and describing natural transcript generation (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>) were compared with our results. For each of these four species, we also analyzed our results on MEDLINE entries referred to in Swiss-Prot entries annotated with the keyword “alternative splicing” [<xref ref-type="bibr" rid="pcbi-0010010-b26">26</xref>]. The average sensitivity of the classifier was 61% (<xref ref-type="table" rid="pcbi-0010010-t002">Table 2</xref>; see <xref ref-type="sec" rid="s3">Materials and Methods</xref>). The SVM classifiers thus achieve good recall and precision and can be used for extracting biological events.</p>
        <table-wrap content-type="2col" id="pcbi-0010010-t002" position="float"><object-id pub-id-type="doi">10.1371/journal.pcbi.0010010.t002</object-id><label>Table 2</label><caption>
            <p>Recall of the SVM Classifier</p>
          </caption><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0010010.t002" xlink:type="simple"/><!--
					<mml:math display='block'>
						<mml:mrow>
							<mml:mi>P</mml:mi>
							<mml:mo stretchy='false'>(</mml:mo>
							<mml:msub>
								<mml:mi>N</mml:mi>
								<mml:mrow>
									<mml:mi>j</mml:mi>
									<mml:mtext>,</mml:mtext>
									<mml:mtext>&hellip;</mml:mtext>
									<mml:mtext>,</mml:mtext>
								</mml:mrow>
							</mml:msub>
							<mml:msub>
								<mml:mi>N</mml:mi>
								<mml:mrow>
									<mml:mi>j</mml:mi>
									<mml:mo>&plus;</mml:mo>
									<mml:mn>3</mml:mn>
									<mml:mi>k</mml:mi>
									<mml:mo stretchy='false'>(</mml:mo>
									<mml:mi>i</mml:mi>
									<mml:mo>&minus;</mml:mo>
									<mml:mn>1</mml:mn>
									<mml:mo stretchy='false'>)</mml:mo>
								</mml:mrow>
							</mml:msub>
							<mml:mo>|</mml:mo>
							<mml:mi>A</mml:mi>
							<mml:mo stretchy='false'>)</mml:mo>
							<mml:mo>&equals;</mml:mo>
							<mml:mrow>
								<mml:mstyle displaystyle='true'>
									<mml:munderover>
										<mml:mo>&Pi;</mml:mo>
										<mml:mrow>
											<mml:mi>i</mml:mi>
											<mml:mo>&equals;</mml:mo>
											<mml:mn>1</mml:mn>
										</mml:mrow>
										<mml:mrow>
											<mml:mi>k</mml:mi>
										</mml:mrow>
									</mml:munderover>
									<mml:msub>
										<mml:mi>P</mml:mi>
										<mml:mrow>
											<mml:mn>6</mml:mn>
										</mml:mrow>
									</mml:msub>
									<mml:mo stretchy='false'>(</mml:mo>
									<mml:msub>
										<mml:mi>N</mml:mi>
										<mml:mrow>
											<mml:mi>j</mml:mi>
											<mml:mo>&plus;</mml:mo>
											<mml:mn>3</mml:mn>
											<mml:mo stretchy='false'>(</mml:mo>
											<mml:mi>i</mml:mi>
											<mml:mo>&minus;</mml:mo>
											<mml:mn>1</mml:mn>
											<mml:mo stretchy='false'>)</mml:mo>
										</mml:mrow>
									</mml:msub>
									<mml:mo>|</mml:mo>
									<mml:msub>
										<mml:mi>A</mml:mi>
										<mml:mrow>
											<mml:mi>k</mml:mi>
											<mml:mo>&plus;</mml:mo>
											<mml:mn>1</mml:mn>
											<mml:mo>&minus;</mml:mo>
											<mml:mi>i</mml:mi>
											<mml:mtext>,</mml:mtext>
											<mml:mn>6</mml:mn>
										</mml:mrow>
									</mml:msub>
									<mml:mo stretchy='false'>)</mml:mo>
									<mml:msub>
										<mml:mi>P</mml:mi>
										<mml:mrow>
											<mml:mn>3</mml:mn>
										</mml:mrow>
									</mml:msub>
									<mml:mo stretchy='false'>(</mml:mo>
									<mml:msub>
										<mml:mi>N</mml:mi>
										<mml:mrow>
											<mml:mi>j</mml:mi>
											<mml:mo>&plus;</mml:mo>
											<mml:mn>3</mml:mn>
											<mml:mo stretchy='false'>(</mml:mo>
											<mml:mi>i</mml:mi>
											<mml:mo>&minus;</mml:mo>
											<mml:mn>1</mml:mn>
											<mml:mo stretchy='false'>)</mml:mo>
											<mml:mo>&plus;</mml:mo>
											<mml:mn>1</mml:mn>
										</mml:mrow>
									</mml:msub>
									<mml:mo>|</mml:mo>
									<mml:msub>
										<mml:mi>A</mml:mi>
										<mml:mrow>
											<mml:mi>k</mml:mi>
											<mml:mo>&plus;</mml:mo>
											<mml:mn>1</mml:mn>
											<mml:mo>&minus;</mml:mo>
											<mml:mi>i</mml:mi>
											<mml:mtext>,</mml:mtext>
											<mml:mn>3</mml:mn>
										</mml:mrow>
									</mml:msub>
									<mml:mo stretchy='false'>)</mml:mo>
									<mml:msub>
										<mml:mi>P</mml:mi>
										<mml:mrow>
											<mml:mo>&minus;</mml:mo>
											<mml:mn>1</mml:mn>
										</mml:mrow>
									</mml:msub>
									<mml:mo stretchy='false'>(</mml:mo>
									<mml:msub>
										<mml:mi>N</mml:mi>
										<mml:mrow>
											<mml:mi>j</mml:mi>
											<mml:mo>&plus;</mml:mo>
											<mml:mn>3</mml:mn>
											<mml:mo stretchy='false'>(</mml:mo>
											<mml:mi>i</mml:mi>
											<mml:mo>&minus;</mml:mo>
											<mml:mn>1</mml:mn>
											<mml:mo stretchy='false'>)</mml:mo>
										</mml:mrow>
										<mml:mo>&plus;</mml:mo>
										<mml:mn>2</mml:mn>
									</mml:msub>
									<mml:mo>|</mml:mo>
									<mml:msub>
										<mml:mi>A</mml:mi>
										<mml:mrow>
											<mml:mi>k</mml:mi>
											<mml:mo>&plus;</mml:mo>
											<mml:mn>1</mml:mn>
											<mml:mo>&minus;</mml:mo>
											<mml:mi>i</mml:mi>
											<mml:mtext>,</mml:mtext>
											<mml:mo>&minus;</mml:mo>
											<mml:mn>1</mml:mn>
										</mml:mrow>
									</mml:msub>
									<mml:mo stretchy='false'>)</mml:mo>
								</mml:mstyle>
							</mml:mrow>
						</mml:mrow>
					</mml:math> --></table-wrap>
      </sec>
      <sec id="s2c">
        <title>Performance of the IE Step</title>
        <p>From the sentences retrieved by the SVM classifier, we extracted instances of eight semantic categories (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>) and evaluated the precision and recall by manually inspecting 300 randomly selected sentences for each category (see <xref ref-type="table" rid="pcbi-0010010-t001">Table 1</xref>). Both precision and recall are highly satisfactory; however, it should be noted that accuracy in finding tag boundaries was not considered. Also, the recall is good for all categories, but not all eight categories are equally represented in the sentences (see <xref ref-type="table" rid="pcbi-0010010-t001">Table 1</xref>).</p>
      </sec>
      <sec id="s2d">
        <title>Proposing New Annotations in Curated Databases</title>
        <p>Annotators at the National Library of Medicine have manually assigned the MeSH term “alternative splicing” to 8,133 abstracts. During the IE step, we identified 1,536 additional abstracts that mention AS but lack the MeSH term “alternative splicing,” corresponding to a 19% increase in annotation. We also identified DP and AP in 874 and 219 abstracts, respectively, for which we propose the new MeSH terms “alternative promoters” and “alternative polyadenylation” (<xref ref-type="supplementary-material" rid="pcbi-0010010-st003">Tables S3</xref>–<xref ref-type="supplementary-material" rid="pcbi-0010010-st006">S6</xref>).</p>
        <p>We also quantified the number of Ensembl genes for which we can propose new annotations for AS (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>). The annotation increase observed was 20%, 52%, and 105% for human, mouse, and rat genomes, respectively (<xref ref-type="supplementary-material" rid="pcbi-0010010-sg002">Figure S2</xref>). These tentative assignments can supplement the work of curators, and the numbers are likely to reflect the current extent of manual curation for these different genomes. The annotation increase for the human genes was relatively little compared to that for the rat genes because a total of 3,438 genes are already annotated in Swiss-Prot and RefSeq for AS in human, whereas only 342 genes are annotated for AS in rat. Even more annotations could be obtained by manually curating extracted events that could not be automatically mapped to a sequence database entry; we have manually mapped 190 genes exhibiting tissue-specific splicing. The observed increase in the annotation emphasizes the need for automated methods to speed up the process of database curation.</p>
      </sec>
      <sec id="s2e">
        <title>Quantification of the Different Mechanisms That Lead to TD</title>
        <p>The majority of vertebrate multi-exon genes undergo AS [<xref ref-type="bibr" rid="pcbi-0010010-b10">10</xref>]. Moreover, different promoters may control the transcription of different mRNA isoforms, which may result in directed 5′ exon inclusion/exclusion, and AP signals can control the tissue specificity of alternative 3′ exons. While examples of synergy between these mechanisms are known, the extent of it is currently being explored. We found DP co-mentioned with AS in 14% of abstracts describing genes with differential promoters. A total of 19% of the abstracts providing information about alternative first exon usage also mentioned usage of different promoters. A total of 17% abstracts describing AP also mentioned AS.</p>
        <p>The extent to which various mechanisms are utilized for increasing TD may vary across different anatomical systems. To study this, we mapped all vertebrate tissue information to anatomical systems using the MeSH anatomy terms and counted the number of nonredundant events extracted for each mechanism in each system (<xref ref-type="fig" rid="pcbi-0010010-g002">Figure 2</xref>, top panel). AS is utilized equally in most organs except in the nervous system, where AS is significantly overrepresented (<xref ref-type="fig" rid="pcbi-0010010-g002">Figure 2</xref>, bottom panel). Similarly, there is significant overrepresentation of DP in the connective tissues and to a lesser extent in the digestive system and in the genitalia.</p>
        <fig id="pcbi-0010010-g002" position="float">
          <object-id pub-id-type="doi">10.1371/journal.pcbi.0010010.g002</object-id>
          <label>Figure 2</label>
          <caption>
            <title>Preference for the Utilization of TD-Generating Mechanisms across Anatomical Systems</title>
            <p>Nonredundant instances of AS, DP, and AP are plotted against anatomical systems in which expression was found. The color of each square in the top panel signifies the ratio of the number of events detected for the system to the highest number of events within the row. Total number of nonredundant instances for each mechanism is on the left. The bottom panel shows the negative logarithm of <italic>p</italic>-values (see <xref ref-type="sec" rid="s3">Materials and Methods</xref> for details). The anatomical systems are as follows: A, cardio vascular system; B, cells; C, connective tissues; D, digestive system; E, fetal/embryonic structures; F, endocrine system; G, exocrine glands; H, genitalia; I, immune system; J, integumentary system; K, musculoskeletal system; L, nervous system; M, respiratory system; N, sense regions; O, urinal system.</p>
          </caption>
          <graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0010010.g002" xlink:type="simple"/>
        </fig>
        <p>The information about alternative promoter usage linked with specific gene names and tissues extracted in this study is the largest such collection available, to our knowledge. We expect that it would provide a reliable dataset for development of computational methods for predicting tissue-specific promoter usage.</p>
      </sec>
      <sec id="s2f">
        <title>Tissue-Specific Differences in the Extent of AS</title>
        <p>AS has been shown to play an important role in creating functional specialization of tissues and development stages [<xref ref-type="bibr" rid="pcbi-0010010-b30">30</xref>,<xref ref-type="bibr" rid="pcbi-0010010-b31">31</xref>], but only a small number of instances of tissue-specific splicing are listed in the current AS databases [<xref ref-type="bibr" rid="pcbi-0010010-b32">32</xref>,<xref ref-type="bibr" rid="pcbi-0010010-b33">33</xref>]. With a large collection of high-quality AS events in hand, tissue-specific differences in AS should become visible. We checked entries in our database containing the field “specificity.” We identified 959 events describing tissue specificity in AS. These represented 675 AS events for pairs of tissues and 284 events where only one tissue was reported. The results contained 400 nonredundant events for 183 human genes. We also mapped a further 190 genes (not included above) from various species to Swiss-Prot identifiers during the manual curation.</p>
        <p>To study the extent of tissue-specific AS, we mapped tissues and organs to respective systems as described in the previous section and plotted the results (<xref ref-type="fig" rid="pcbi-0010010-g003">Figure 3</xref>, left panel). The nervous system, genitalia, and immune, digestive, and musculoskeletal systems showed extensive tissue specificity in inter- and intra-systemic AS. These systems also showed expression of unique AS transcripts, with the nervous system showing the highest number of unique transcripts. These tissue-specific patterns of expression extracted from the literature strongly overlap with the 667 tissue-specific AS events derived from analysis of the EST data [<xref ref-type="bibr" rid="pcbi-0010010-b33">33</xref>] for 454 human genes across 46 tissues (<xref ref-type="fig" rid="pcbi-0010010-g003">Figure 3</xref>, right panel).</p>
        <fig id="pcbi-0010010-g003" position="float">
          <object-id pub-id-type="doi">10.1371/journal.pcbi.0010010.g003</object-id>
          <label>Figure 3</label>
          <caption>
            <title>Tissue Specificity in AS</title>
            <p>The figure shows the body system distribution of differential/specific splicing. The instances were obtained from literature mining (left panel) and analysis of EST data ([<xref ref-type="bibr" rid="pcbi-0010010-b33">33</xref>]; right panel). Each square is colored according to the ratio between the corresponding count and the highest count within the panel. Letter codes for anatomical systems are as in <xref ref-type="fig" rid="pcbi-0010010-g002">Figure 2</xref>. P represents a unique transcript.</p>
          </caption>
          <graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0010010.g003" xlink:type="simple"/>
        </fig>
        <p>The knowledge extracted from the literature confirms EST-based studies [<xref ref-type="bibr" rid="pcbi-0010010-b31">31</xref>,<xref ref-type="bibr" rid="pcbi-0010010-b33">33</xref>] and earlier experimental studies [<xref ref-type="bibr" rid="pcbi-0010010-b34">34</xref>] that showed AS as the preferred mechanism for generating TD across the nervous system. EST-based studies [<xref ref-type="bibr" rid="pcbi-0010010-b31">31</xref>] have also suggested that genes in liver (digestive system) and testis (genitalia) show distinct patterns of splicing with alternative exons. Our results indicate that these transcripts may show these different patterns of splicing in combination with different promoter regions. This conclusion seems plausible since AS of first exons is influenced by alternative promoter regions in at least 19% of cases (see above; [<xref ref-type="bibr" rid="pcbi-0010010-b35">35</xref>]), and it should be explored further.</p>
      </sec>
      <sec id="s2g">
        <title>Assigning Function to the Transcripts Generated by Computational Analysis</title>
        <p>Sometimes experimental biologists speculate about the mechanism responsible for the multiple transcripts observed with a limited number of experiments but the corresponding transcripts are not deposited in GenBank. For example, work by Pisarra et al. [<xref ref-type="bibr" rid="pcbi-0010010-b36">36</xref>] on human <italic>Dopachrome tautomerase</italic> describes two transcripts in melanocytes and melanomas with a “different carboxyl-terminus” generated, concluding that “dopachrome tautomerase can yield different isoforms by alternative poly(A) site usage or by alternative splicing” (<xref ref-type="fig" rid="pcbi-0010010-g004">Figure 4</xref>).</p>
        <fig id="pcbi-0010010-g004" position="float">
          <object-id pub-id-type="doi">10.1371/journal.pcbi.0010010.g004</object-id>
          <label>Figure 4</label>
          <caption>
            <title>Assignment of Function Using Database Knowledge</title>
            <p>This figure shows a database entry that derives very little functional annotation from sequence databases. Text extraction rules were successful in identifying gene name, tissue, and event mechanism for the <italic>Dopachrome tautomerase</italic> gene. Multiple transcripts of the gene using SPLICE-POA [<xref ref-type="bibr" rid="pcbi-0010010-b37">37</xref>] were produced by utilizing alternative 3′ splice sites and polyadenylation signals as speculated in the research article (bottom panel). Pink rectangles denote the exons, black lines describe constitutive splice sites, and blue lines show alternative splice sites. Black arrows show the different proteins generated via AS.</p>
          </caption>
          <graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0010010.g004" xlink:type="simple"/>
        </fig>
        <p>On the other hand, various methods, including those based on aligning EST and other sequence data to genomic regions, are currently in use for detecting AS on a large scale. The function of the isoforms thus generated is largely unknown [<xref ref-type="bibr" rid="pcbi-0010010-b37">37</xref>], and these transcripts are poorly annotated in sequence databases.</p>
        <p>Using the heaviest bundling algorithm [<xref ref-type="bibr" rid="pcbi-0010010-b37">37</xref>] with genomic sequence data from Ensembl [<xref ref-type="bibr" rid="pcbi-0010010-b38">38</xref>], and transcript data from UniGene [<xref ref-type="bibr" rid="pcbi-0010010-b39">39</xref>] clusters for the gene, we were able to generate two transcript isoforms for <italic>Dopachrome tautomerase</italic> (<xref ref-type="fig" rid="pcbi-0010010-g004">Figure 4</xref>, bottom) resembling those described by Pisarra et al. [<xref ref-type="bibr" rid="pcbi-0010010-b36">36</xref>] and were able to detect an AS event in the 3′ region. Hence, the use of large-scale methods may provide detailed information about underlying events, and text mining would provide functional annotations to the transcript isoforms observed.</p>
      </sec>
      <sec id="s2h">
        <title>Conclusions</title>
        <p>We successfully extracted information about the genes that express multiple transcripts and associated spatiotemporal information using state-of-the-art methods in natural language processing and utilized it for function annotations. The information extracted by far exceeds current manual curation efforts and generates reliable results. Our results indicate that mechanisms like AS, DP, and AP work in concert for the generation and regulation of TD. They also suggest that the nervous system preferentially relies on AS over other mechanisms to express the largest set of tissue-specific transcripts. In contrast, genitalia and the digestive system more frequently make use of alternative promoter regions. The knowledge stored in the database about synergy and preference for TD-generating mechanisms across tissues will be integrated to high-throughput data in the future. More generally, IE of complex biological processes seems feasible and can also complement large-scale data generation in other areas to assign function.</p>
      </sec>
    </sec>
    <sec id="s3">
      <title>Materials and Methods</title>
      <sec>
        <title/>
        <sec id="s3a1">
          <title>Training corpus and SVM learning.</title>
          <p>A set of 4,240 sentences describing physiological TD and 13,520 negative sentences were selected as a training corpus from article titles and abstracts. Sentences describing mutations, clinical studies involving patients, nucleotide transversions, and splicing mechanisms were considered negative sentences. Sentences describing natural gene expression, gene paralogs, and aberrant transcripts showed word usage similar to those describing TD, making the classification task more challenging. Description of the learning corpus can be found in <xref ref-type="supplementary-material" rid="pcbi-0010010-sd001">Protocol S1</xref> and <xref ref-type="supplementary-material" rid="pcbi-0010010-sg003">Figure S3</xref>.</p>
          <p>The text in all the abstracts was split into sentences using the Oak system (S. Sekine, unpublished data; <ext-link ext-link-type="uri" xlink:href="http://nlp.cs.nyu.edu/oak/" xlink:type="simple">http://nlp.cs.nyu.edu/oak/</ext-link>). All the sentences were tagged with Tree-tagger [<xref ref-type="bibr" rid="pcbi-0010010-b40">40</xref>] to give words their part-of-speech tags. Sentences were broken down into constituent words and stemmed to act as features to learn from. Stop words and words with certain part-of-speech tags were removed from the primary features. To add domain knowledge and enrich the features to learn from, frequently occurring word bi-grams and tri-grams were also defined from unprocessed sentences. The feature file was large, containing 23,742 features.</p>
          <p>The procedure of inductive learning (see <xref ref-type="supplementary-material" rid="pcbi-0010010-sd001">Protocol S1</xref>) was applied for the sentence classification task, using the feature vectors described above as input. We compared the performance of naïve Bayes, expectation maximization, maximum entropy, variants of TF-IDF, K-nearest neighbors, and support vector machines for the task [<xref ref-type="bibr" rid="pcbi-0010010-b21">21</xref>,<xref ref-type="bibr" rid="pcbi-0010010-b41">41</xref>–<xref ref-type="bibr" rid="pcbi-0010010-b43">43</xref>]. The SVM with a radial basis function kernel (gamma = 1.5 and <italic>C</italic> = 100) outperformed other methods and SVM classifiers with linear and sigmoid kernel functions (P. K. Shah and P. Bork, unpublished data).</p>
          <p>The classifier was trained to extract only the natural TD from the written text, as contrasted by aberrant transcripts that are caused by genetic changes. For consistency, we removed the 2,767 of the 8,133 MEDLINE entries with the MeSH term “alternative splicing” that also had the MeSH term “mutation,” had no abstract text, or had erroneous assignment of the MeSH term “alternative splicing.”</p>
        </sec>
        <sec id="s3a2">
          <title>Definitions of precision and recall.</title>
          <p>Precision and recall are used in IR to measure the performance of methods and they are defined as below. </p>
          <disp-formula id="pcbi-0010010-e001"><graphic mimetype="image" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0010010.e001" xlink:type="simple"/></disp-formula>
          <p> Where, TP, TN, FP, and FN denote true-positive, true-negative, false-positive, and false-negative elements according to a classification criterion.</p>
        </sec>
        <sec id="s3a3">
          <title>Parsing of the sentences using semantic patterns.</title>
          <p>An event or a scenario is described in a sentence via the combination of a predicate (normally a verb) and its arguments [<xref ref-type="bibr" rid="pcbi-0010010-b22">22</xref>–<xref ref-type="bibr" rid="pcbi-0010010-b24">24</xref>,<xref ref-type="bibr" rid="pcbi-0010010-b44">44</xref>]. While the same biological relation can be described in many syntactically different ways, only a limited number of semantic categories (e.g., gene name or tissue name) may accompany the predicates (see <xref ref-type="supplementary-material" rid="pcbi-0010010-sd001">Protocol S1</xref> for further discussion). Therefore, at this step we can apply rule-based methods without much loss of sensitivity.</p>
          <p>We constructed semantic patterns similar to those described in the PASBio database of predicate–argument structure [<xref ref-type="bibr" rid="pcbi-0010010-b22">22</xref>]. These patterns match informative parts of sentences, e.g., <italic>“gene</italic> lacks exon <italic>n</italic> in <italic>tissue.”</italic> The Stanford lexical parser was used for parsing the sentences [<xref ref-type="bibr" rid="pcbi-0010010-b45">45</xref>,<xref ref-type="bibr" rid="pcbi-0010010-b46">46</xref>]. Sentence trees were viewed using the TigerSearch tool for generating extraction rules for taking the semantic patterns from sentences [<xref ref-type="bibr" rid="pcbi-0010010-b47">47</xref>]. (See <xref ref-type="supplementary-material" rid="pcbi-0010010-sd001">Protocol S1</xref> for examples of rules.)</p>
          <p>The success in assigning gene, species, and event mechanisms to abstracts is as follows (<xref ref-type="supplementary-material" rid="pcbi-0010010-sg003">Figure S3</xref>). A total of 46% of all abstracts were directly mapped to literature entries in sequence databases such as Swiss-Prot, RefSeq, and GenBank. A further 15% of all abstracts were assigned gene names using a gene tagger [<xref ref-type="bibr" rid="pcbi-0010010-b48">48</xref>], with the species name extracted from the sentences and/or from the MeSH terms mapped with the synonym list. However, only 54% of all abstracts could be unambiguously assigned to a unique species (see <xref ref-type="fig" rid="pcbi-0010010-g002">Figure 2</xref>, category A in lower right histogram). The rest of the abstracts may have had gene and species information but they could not be assigned to a sequence database. Tissues were tagged using a dictionary made of tissue lists from the Swiss-Prot and RefSeq databases. They were assigned to the relevant anatomical system (top level MeSH anatomy terms) using the MeSH browser. We have submitted these entries for manual curation to EMBL-EBI's Alternative Exon Database [<xref ref-type="bibr" rid="pcbi-0010010-b32">32</xref>].</p>
        </sec>
        <sec id="s3a4">
          <title>Quantifying the gain in gene annotation.</title>
          <p>To quantify the gain in gene annotation, first we mapped sequence information to the MEDLINE identifiers from the SVM classification using literature entries in Swiss-Prot, RefSeq, and GenBank. Second, we mapped sequence-containing entries for human, mouse, and rat genes present in our results and in those databases to Ensembl gene identifiers using the EnsMart system. Then we compared our annotations to those of Swiss-Prot and RefSeq to identify genes that were missed during the manual curation of AS. Special care was taken to avoid annotations that may have arisen because of a single literature entry mapping to multiple database entries. Hence, these annotations were highly significant.</p>
        </sec>
        <sec id="s3a5">
          <title>Associating TD-generating mechanisms with organ systems.</title>
          <p>The significance of the association of each TD-generating mechanism with each organ system was evaluated using the hypergeometric distribution. We corrected these <italic>p</italic>-values for multiple testing by calculating the false discovery rate using the Benjamini-Hochberg formula [<xref ref-type="bibr" rid="pcbi-0010010-b49">49</xref>]. We found 14 significant associations (out of 45) at a 5% false discovery rate, three of which were also significant at a 1% false discovery rate.</p>
        </sec>
      </sec>
    </sec>
    <sec id="s4">
      <title>Supporting Information</title>
      <supplementary-material id="pcbi-0010010-sg001" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0010010.sg001" xlink:type="simple">
        <label>Figure S1</label>
        <caption>
          <title>An Example Database Entry</title>
          <p>(1.7 MB TIF).</p>
        </caption>
      </supplementary-material>
      <supplementary-material id="pcbi-0010010-sg002" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0010010.sg002" xlink:type="simple">
        <label>Figure S2</label>
        <caption>
          <title>Distribution of the Results of the IE Step</title>
          <p>(4.6 MB TIF).</p>
        </caption>
      </supplementary-material>
      <supplementary-material id="pcbi-0010010-sg003" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0010010.sg003" xlink:type="simple">
        <label>Figure S3</label>
        <caption>
          <title>Description of the Training Set</title>
          <p>(60 KB PDF).</p>
        </caption>
      </supplementary-material>
      <supplementary-material id="pcbi-0010010-sd001" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0010010.sd001" xlink:type="simple">
        <label>Protocol S1</label>
        <caption>
          <title>Supplementary Text</title>
          <p>(112 KB PDF).</p>
        </caption>
      </supplementary-material>
      <supplementary-material id="pcbi-0010010-st001" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0010010.st001" xlink:type="simple">
        <label>Table S1</label>
        <caption>
          <title>Genes and Associated TD-Generating Mechanism</title>
          <p>(423 KB TXT).</p>
        </caption>
      </supplementary-material>
      <supplementary-material id="pcbi-0010010-st002" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0010010.st002" xlink:type="simple">
        <label>Table S2</label>
        <caption>
          <title>Genes and Tissues</title>
          <p>(120 KB TXT).</p>
        </caption>
      </supplementary-material>
      <supplementary-material id="pcbi-0010010-st003" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0010010.st003" xlink:type="simple">
        <label>Table S3</label>
        <caption>
          <title>Abstracts Describing AS</title>
          <p>(445 KB XLS).</p>
        </caption>
      </supplementary-material>
      <supplementary-material id="pcbi-0010010-st004" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0010010.st004" xlink:type="simple">
        <label>Table S4</label>
        <caption>
          <title>Abstracts Describing Alternative Promoters</title>
          <p>(76 KB XLS).</p>
        </caption>
      </supplementary-material>
      <supplementary-material id="pcbi-0010010-st005" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0010010.st005" xlink:type="simple">
        <label>Table S5</label>
        <caption>
          <title>Abstracts Describing Alternative Initiation</title>
          <p>(20 KB XLS).</p>
        </caption>
      </supplementary-material>
      <supplementary-material id="pcbi-0010010-st006" position="float" xlink:href="info:doi/10.1371/journal.pcbi.0010010.st006" xlink:type="simple">
        <label>Table S6</label>
        <caption>
          <title>Abstracts Describing AP</title>
          <p>(29 KB XLS).</p>
        </caption>
      </supplementary-material>
    </sec>
  </body>
  <back>
    <ack>
      <p>Authors would like to thank Yi Xing and Dr. Christopher Lee for providing the code for SPLICE-POA and the isoform generation algorithm.</p>
    </ack>
    
    <glossary>
      <title>Abbreviations</title>
      <def-list>
        <def-item>
          <term>AP</term>
          <def>
            <p>alternative polyadenylation</p>
          </def>
        </def-item>
        <def-item>
          <term>AS</term>
          <def>
            <p>alternative splicing</p>
          </def>
        </def-item>
        <def-item>
          <term>DP</term>
          <def>
            <p>differential promoter usage</p>
          </def>
        </def-item>
        <def-item>
          <term>IE</term>
          <def>
            <p>information extraction</p>
          </def>
        </def-item>
        <def-item>
          <term>IR</term>
          <def>
            <p>information retrieval</p>
          </def>
        </def-item>
        <def-item>
          <term>SVM</term>
          <def>
            <p>support vector machine</p>
          </def>
        </def-item>
        <def-item>
          <term>TD</term>
          <def>
            <p>transcript diversity</p>
          </def>
        </def-item>
      </def-list>
    </glossary>
    <ref-list>
      <title>
        <bold>References</bold>
      </title>
      <ref id="pcbi-0010010-b01">
        <label>1</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Landry</surname><given-names>JR</given-names></name><name name-style="western"><surname>Mager</surname><given-names>DL</given-names></name><name name-style="western"><surname>Wilhelm</surname><given-names>BT</given-names></name></person-group>
					<year>2003</year>
					<article-title>Complex controls: The role of alternative promoters in mammalian genomes.</article-title>
					<source>Trends Genet</source>
					<volume>19</volume>
					<fpage>640</fpage>
					<lpage>648</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b02">
        <label>2</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Garcia-Blanco</surname><given-names>MA</given-names></name><name name-style="western"><surname>Baraniak</surname><given-names>AP</given-names></name><name name-style="western"><surname>Lasda</surname><given-names>EL</given-names></name></person-group>
					<year>2004</year>
					<article-title>Alternative splicing in disease and therapy.</article-title>
					<source>Nat Biotechnol</source>
					<volume>22</volume>
					<fpage>535</fpage>
					<lpage>546</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b03">
        <label>3</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Modrek</surname><given-names>B</given-names></name><name name-style="western"><surname>Lee</surname><given-names>C</given-names></name></person-group>
					<year>2002</year>
					<article-title>A genomic view of alternative splicing.</article-title>
					<source>Nat Genet</source>
					<volume>30</volume>
					<fpage>13</fpage>
					<lpage>19</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b04">
        <label>4</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Black</surname><given-names>DL</given-names></name></person-group>
					<year>2003</year>
					<article-title>Mechanisms of alternative pre-messenger RNA splicing.</article-title>
					<source>Annu Rev Biochem</source>
					<volume>72</volume>
					<fpage>291</fpage>
					<lpage>336</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b05">
        <label>5</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Boue</surname><given-names>S</given-names></name><name name-style="western"><surname>Letunic</surname><given-names>I</given-names></name><name name-style="western"><surname>Bork</surname><given-names>P</given-names></name></person-group>
					<year>2003</year>
					<article-title>Alternative splicing and evolution.</article-title>
					<source>Bioessays</source>
					<volume>25</volume>
					<fpage>1031</fpage>
					<lpage>1034</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b06">
        <label>6</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Edwalds-Gilbert</surname><given-names>G</given-names></name><name name-style="western"><surname>Veraldi</surname><given-names>KL</given-names></name><name name-style="western"><surname>Milcarek</surname><given-names>C</given-names></name></person-group>
					<year>1997</year>
					<article-title>Alternative poly(A) site selection in complex transcription units: Means to an end?</article-title>
					<source>Nucleic Acids Res</source>
					<volume>25</volume>
					<fpage>2547</fpage>
					<lpage>2561</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b07">
        <label>7</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Graveley</surname><given-names>BR</given-names></name></person-group>
					<year>2001</year>
					<article-title>Alternative splicing: Increasing diversity in the proteomic world.</article-title>
					<source>Trends Genet</source>
					<volume>17</volume>
					<fpage>100</fpage>
					<lpage>107</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b08">
        <label>8</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Brett</surname><given-names>D</given-names></name><name name-style="western"><surname>Pospisil</surname><given-names>H</given-names></name><name name-style="western"><surname>Valcarcel</surname><given-names>J</given-names></name><name name-style="western"><surname>Reich</surname><given-names>J</given-names></name><name name-style="western"><surname>Bork</surname><given-names>P</given-names></name></person-group>
					<year>2002</year>
					<article-title>Alternative splicing and genome complexity.</article-title>
					<source>Nat Genet</source>
					<volume>30</volume>
					<fpage>29</fpage>
					<lpage>30</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b09">
        <label>9</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Lareau</surname><given-names>LF</given-names></name><name name-style="western"><surname>Green</surname><given-names>RE</given-names></name><name name-style="western"><surname>Bhatnagar</surname><given-names>RS</given-names></name><name name-style="western"><surname>Brenner</surname><given-names>SE</given-names></name></person-group>
					<year>2004</year>
					<article-title>The evolving roles of alternative splicing.</article-title>
					<source>Curr Opin Struct Biol</source>
					<volume>14</volume>
					<fpage>273</fpage>
					<lpage>282</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b10">
        <label>10</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Johnson</surname><given-names>JM</given-names></name><name name-style="western"><surname>Castle</surname><given-names>J</given-names></name><name name-style="western"><surname>Garrett-Engele</surname><given-names>P</given-names></name><name name-style="western"><surname>Kan</surname><given-names>Z</given-names></name><name name-style="western"><surname>Loerch</surname><given-names>PM</given-names></name><etal/></person-group>
					<year>2003</year>
					<article-title>Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays.</article-title>
					<source>Science</source>
					<volume>302</volume>
					<fpage>2141</fpage>
					<lpage>2144</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b11">
        <label>11</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Hu</surname><given-names>GK</given-names></name><name name-style="western"><surname>Madore</surname><given-names>SJ</given-names></name><name name-style="western"><surname>Moldover</surname><given-names>B</given-names></name><name name-style="western"><surname>Jatkoe</surname><given-names>T</given-names></name><name name-style="western"><surname>Balaban</surname><given-names>D</given-names></name><etal/></person-group>
					<year>2001</year>
					<article-title>Predicting splice variant from DNA chip expression data.</article-title>
					<source>Genome Res</source>
					<volume>11</volume>
					<fpage>1237</fpage>
					<lpage>1245</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b12">
        <label>12</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Modrek</surname><given-names>B</given-names></name><name name-style="western"><surname>Resch</surname><given-names>A</given-names></name><name name-style="western"><surname>Grasso</surname><given-names>C</given-names></name><name name-style="western"><surname>Lee</surname><given-names>C</given-names></name></person-group>
					<year>2001</year>
					<article-title>Genome-wide detection of alternative splicing in expressed sequences of human genes.</article-title>
					<source>Nucleic Acids Res</source>
					<volume>29</volume>
					<fpage>2850</fpage>
					<lpage>2859</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b13">
        <label>13</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Modrek</surname><given-names>B</given-names></name><name name-style="western"><surname>Lee</surname><given-names>CJ</given-names></name></person-group>
					<year>2003</year>
					<article-title>Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss.</article-title>
					<source>Nat Genet</source>
					<volume>34</volume>
					<fpage>177</fpage>
					<lpage>180</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b14">
        <label>14</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Philipps</surname><given-names>DL</given-names></name><name name-style="western"><surname>Park</surname><given-names>JW</given-names></name><name name-style="western"><surname>Graveley</surname><given-names>BR</given-names></name></person-group>
					<year>2004</year>
					<article-title>A computational and experimental approach toward a priori identification of alternatively spliced exons.</article-title>
					<source>RNA</source>
					<volume>10</volume>
					<fpage>1838</fpage>
					<lpage>1844</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b15">
        <label>15</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Andrade</surname><given-names>MA</given-names></name><name name-style="western"><surname>Bork</surname><given-names>P</given-names></name></person-group>
					<year>2000</year>
					<article-title>Automated extraction of information in molecular biology.</article-title>
					<source>FEBS Lett</source>
					<volume>476</volume>
					<fpage>12</fpage>
					<lpage>17</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b16">
        <label>16</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>de Bruijn</surname><given-names>B</given-names></name><name name-style="western"><surname>Martin</surname><given-names>J</given-names></name></person-group>
					<year>2002</year>
					<article-title>Getting to the (c)ore of knowledge: Mining biomedical literature.</article-title>
					<source>Int J Med Inform</source>
					<volume>67</volume>
					<fpage>7</fpage>
					<lpage>18</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b17">
        <label>17</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Shatkay</surname><given-names>H</given-names></name><name name-style="western"><surname>Feldman</surname><given-names>R</given-names></name></person-group>
					<year>2003</year>
					<article-title>Mining the biomedical literature in the genomic era: An overview.</article-title>
					<source>J Comput Biol</source>
					<volume>10</volume>
					<fpage>821</fpage>
					<lpage>855</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b18">
        <label>18</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Hirschman</surname><given-names>L</given-names></name><name name-style="western"><surname>Park</surname><given-names>JC</given-names></name><name name-style="western"><surname>Tsujii</surname><given-names>J</given-names></name><name name-style="western"><surname>Wong</surname><given-names>L</given-names></name><name name-style="western"><surname>Wu</surname><given-names>CH</given-names></name></person-group>
					<year>2002</year>
					<article-title>Accomplishments and challenges in literature data mining for biology.</article-title>
					<source>Bioinformatics</source>
					<volume>18</volume>
					<fpage>1553</fpage>
					<lpage>1561</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b19">
        <label>19</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Cristianini</surname><given-names>N</given-names></name><name name-style="western"><surname>Shawe-Taylor</surname><given-names>J</given-names></name></person-group>
					<year>2000</year>
					<source>An introduction to support vector machines and other kernel-based learning methods</source>
					<publisher-loc>Cambridge</publisher-loc>
					<publisher-name>Cambridge University Press</publisher-name>
					<!--===== Restructure page-count as size[@units="page"] =====--><size units="page">189</size>
					<comment>p.</comment>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b20">
        <label>20</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Vapnik</surname><given-names>VN</given-names></name></person-group>
					<year>2000</year>
					<source>The nature of statistical learning theory, 2nd ed</source>
					<publisher-loc>New York</publisher-loc>
					<publisher-name>Springer</publisher-name>
					<!--===== Restructure page-count as size[@units="page"] =====--><size units="page">314</size>
					<comment>p.</comment>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b21">
        <label>21</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Joachims</surname><given-names>T</given-names></name></person-group>
					<year>2001</year>
					<source>Learning to classify text using support vector machines: Methods, theory and algorithms</source>
					<publisher-loc>Boston</publisher-loc>
					<publisher-name>Kluwer Academic Publishers</publisher-name>
					<!--===== Restructure page-count as size[@units="page"] =====--><size units="page">205</size>
					<comment>p.</comment>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b22">
        <label>22</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Wattarujeekrit</surname><given-names>T</given-names></name><name name-style="western"><surname>Shah</surname><given-names>P</given-names></name><name name-style="western"><surname>Collier</surname><given-names>N</given-names></name></person-group>
					<year>2004</year>
					<article-title>PASBio: Predicate-argument structures for event extraction in molecular biology.</article-title>
					<source>BMC Bioinformatics</source>
					<volume>5</volume>
					<fpage>155</fpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b23">
        <label>23</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Marcus</surname><given-names>M</given-names></name></person-group>
					<year>1994</year>
					<source>The Penn Treebank: A revised corpus design for extracting predicate-argument structure. 1994 ARPA Human Language Technology Workshop; 1994 March; Princeton, New Jersey</source>
					<publisher-loc>San Francisco</publisher-loc>
					<publisher-name>Morgan Kaufmann</publisher-name>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b24">
        <label>24</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Surdeanu</surname><given-names>M</given-names></name><name name-style="western"><surname>Harabagiu</surname><given-names>S</given-names></name><name name-style="western"><surname>Williams</surname><given-names>J</given-names></name><name name-style="western"><surname>Aarseth</surname><given-names>P</given-names></name></person-group>
					<year>2003</year>
					<source>Using predicate-argument structures for information extraction</source>
					<comment>Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics; 2003; Sapporo, Japan. pp. 8–15.</comment>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b25">
        <label>25</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Mitchell</surname><given-names>TM</given-names></name></person-group>
					<year>1997</year>
					<source>Machine learning</source>
					<publisher-loc>New York</publisher-loc>
					<publisher-name>McGraw-Hill</publisher-name>
					<!--===== Restructure page-count as size[@units="page"] =====--><size units="page">414</size>
					<comment>p.</comment>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b26">
        <label>26</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Bairoch</surname><given-names>A</given-names></name><name name-style="western"><surname>Apweiler</surname><given-names>R</given-names></name></person-group>
					<year>2000</year>
					<article-title>The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000.</article-title>
					<source>Nucleic Acids Res</source>
					<volume>28</volume>
					<fpage>302</fpage>
					<lpage>303</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b27">
        <label>27</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Pruitt</surname><given-names>KD</given-names></name><name name-style="western"><surname>Maglott</surname><given-names>DR</given-names></name></person-group>
					<year>2001</year>
					<article-title>RefSeq and LocusLink: NCBI gene-centered resources.</article-title>
					<source>Nucleic Acids Res</source>
					<volume>29</volume>
					<fpage>137</fpage>
					<lpage>140</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b28">
        <label>28</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Benson</surname><given-names>DA</given-names></name><name name-style="western"><surname>Karsch-Mizrachi</surname><given-names>I</given-names></name><name name-style="western"><surname>Lipman</surname><given-names>DJ</given-names></name><name name-style="western"><surname>Ostell</surname><given-names>J</given-names></name><name name-style="western"><surname>Wheeler</surname><given-names>DL</given-names></name></person-group>
					<year>2004</year>
					<article-title>GenBank: Update.</article-title>
					<source>Nucleic Acids Res</source>
					<volume>32</volume>
					<fpage>D23</fpage>
					<lpage>D26</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b29">
        <label>29</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Birney</surname><given-names>E</given-names></name><name name-style="western"><surname>Andrews</surname><given-names>TD</given-names></name><name name-style="western"><surname>Bevan</surname><given-names>P</given-names></name><name name-style="western"><surname>Caccamo</surname><given-names>M</given-names></name><name name-style="western"><surname>Chen</surname><given-names>Y</given-names></name><etal/></person-group>
					<year>2004</year>
					<article-title>An overview of Ensembl.</article-title>
					<source>Genome Res</source>
					<volume>14</volume>
					<fpage>925</fpage>
					<lpage>928</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b30">
        <label>30</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Grabowski</surname><given-names>PJ</given-names></name><name name-style="western"><surname>Black</surname><given-names>DL</given-names></name></person-group>
					<year>2001</year>
					<article-title>Alternative RNA splicing in the nervous system.</article-title>
					<source>Prog Neurobiol</source>
					<volume>65</volume>
					<fpage>289</fpage>
					<lpage>308</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b31">
        <label>31</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Yeo</surname><given-names>G</given-names></name><name name-style="western"><surname>Holste</surname><given-names>D</given-names></name><name name-style="western"><surname>Kreiman</surname><given-names>G</given-names></name><name name-style="western"><surname>Burge</surname><given-names>CB</given-names></name></person-group>
					<year>2004</year>
					<article-title>Variation in alternative splicing across human tissues.</article-title>
					<source>Genome Biol</source>
					<volume>5</volume>
					<fpage>R74</fpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b32">
        <label>32</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Thanaraj</surname><given-names>TA</given-names></name><name name-style="western"><surname>Stamm</surname><given-names>S</given-names></name><name name-style="western"><surname>Clark</surname><given-names>F</given-names></name><name name-style="western"><surname>Riethoven</surname><given-names>JJ</given-names></name><name name-style="western"><surname>Le Texier</surname><given-names>V</given-names></name><etal/></person-group>
					<year>2004</year>
					<article-title>ASD: The Alternative Splicing Database.</article-title>
					<source>Nucleic Acids Res</source>
					<volume>32</volume>
					<fpage>D64</fpage>
					<lpage>D69</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b33">
        <label>33</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Xu</surname><given-names>Q</given-names></name><name name-style="western"><surname>Modrek</surname><given-names>B</given-names></name><name name-style="western"><surname>Lee</surname><given-names>C</given-names></name></person-group>
					<year>2002</year>
					<article-title>Genome-wide detection of tissue-specific alternative splicing in the human transcriptome.</article-title>
					<source>Nucleic Acids Res</source>
					<volume>30</volume>
					<fpage>3754</fpage>
					<lpage>3766</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b34">
        <label>34</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Mirnics</surname><given-names>K</given-names></name><name name-style="western"><surname>Pevsner</surname><given-names>J</given-names></name></person-group>
					<year>2004</year>
					<article-title>Progress in the use of microarray technology to study the neurobiology of disease.</article-title>
					<source>Nat Neurosci</source>
					<volume>7</volume>
					<fpage>434</fpage>
					<lpage>439</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b35">
        <label>35</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Zavolan</surname><given-names>M</given-names></name><name name-style="western"><surname>Kondo</surname><given-names>S</given-names></name><name name-style="western"><surname>Schonbach</surname><given-names>C</given-names></name><name name-style="western"><surname>Adachi</surname><given-names>J</given-names></name><name name-style="western"><surname>Hume</surname><given-names>DA</given-names></name><etal/></person-group>
					<year>2003</year>
					<article-title>Impact of alternative initiation, splicing, and termination on the diversity of the mRNA transcripts encoded by the mouse transcriptome.</article-title>
					<source>Genome Res</source>
					<volume>13</volume>
					<fpage>1290</fpage>
					<lpage>1300</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b36">
        <label>36</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Pisarra</surname><given-names>P</given-names></name><name name-style="western"><surname>Lupetti</surname><given-names>R</given-names></name><name name-style="western"><surname>Palumbo</surname><given-names>A</given-names></name><name name-style="western"><surname>Napolitano</surname><given-names>A</given-names></name><name name-style="western"><surname>Prota</surname><given-names>G</given-names></name><etal/></person-group>
					<year>2000</year>
					<article-title>Human melanocytes and melanomas express novel mRNA isoforms of the tyrosinase-related protein-2/DOPAchrome tautomerase gene: Molecular and functional characterization.</article-title>
					<source>J Invest Dermatol</source>
					<volume>115</volume>
					<fpage>48</fpage>
					<lpage>56</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b37">
        <label>37</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Lee</surname><given-names>C</given-names></name></person-group>
					<year>2003</year>
					<article-title>Generating consensus sequences from partial order multiple sequence alignment graphs.</article-title>
					<source>Bioinformatics</source>
					<volume>19</volume>
					<fpage>999</fpage>
					<lpage>1008</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b38">
        <label>38</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Birney</surname><given-names>E</given-names></name><name name-style="western"><surname>Andrews</surname><given-names>D</given-names></name><name name-style="western"><surname>Bevan</surname><given-names>P</given-names></name><name name-style="western"><surname>Caccamo</surname><given-names>M</given-names></name><name name-style="western"><surname>Cameron</surname><given-names>G</given-names></name><etal/></person-group>
					<year>2004</year>
					<article-title>Ensembl 2004.</article-title>
					<source>Nucleic Acids Res</source>
					<volume>32</volume>
					<fpage>D468</fpage>
					<lpage>D470</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b39">
        <label>39</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Wheeler</surname><given-names>DL</given-names></name><name name-style="western"><surname>Church</surname><given-names>DM</given-names></name><name name-style="western"><surname>Edgar</surname><given-names>R</given-names></name><name name-style="western"><surname>Federhen</surname><given-names>S</given-names></name><name name-style="western"><surname>Helmberg</surname><given-names>W</given-names></name><etal/></person-group>
					<year>2004</year>
					<article-title>Database resources of the National Center for Biotechnology Information: Update.</article-title>
					<source>Nucleic Acids Res</source>
					<volume>32</volume>
					<fpage>D35</fpage>
					<lpage>D40</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b40">
        <label>40</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Schmid</surname><given-names>H</given-names></name></person-group>
					<year>1994</year>
					<source>Probabilistic part-of-speech tagging using decision trees</source>
					<comment>Proceedings of the International Conference on New Methods in Language Processing; 1994 September.</comment>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b41">
        <label>41</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Nigam</surname><given-names>K</given-names></name><name name-style="western"><surname>Lafferty</surname><given-names>J</given-names></name><name name-style="western"><surname>McCallum</surname><given-names>A</given-names></name></person-group>
					<year>1999</year>
					<source>Using maximum entropy for text classification. IJCAI-99 Workshop on Machine Learning for Information Filtering. pp. 61–67</source>
					<comment>Available: <ext-link ext-link-type="uri" xlink:href="http://www-ai.cs.uni-dortmund.de/EVENTS/IJCAI99-MLIF/papers" xlink:type="simple">http://www-ai.cs.uni-dortmund.de/EVENTS/IJCAI99-MLIF/papers</ext-link> (nigam.ps.gz). Accessed 26 May 2005.</comment>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b42">
        <label>42</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Nigam</surname><given-names>K</given-names></name><name name-style="western"><surname>McCallum</surname><given-names>A</given-names></name><name name-style="western"><surname>Thrun</surname><given-names>S</given-names></name><name name-style="western"><surname>Mitchell</surname><given-names>T</given-names></name></person-group>
					<year>2000</year>
					<article-title>Text classification from labeled and unlabeled documents using EM.</article-title>
					<source>Mach Learn</source>
					<volume>39</volume>
					<fpage>103</fpage>
					<lpage>134</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b43">
        <label>43</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>McCallum</surname><given-names>A</given-names></name><name name-style="western"><surname>Nigam</surname><given-names>K</given-names></name></person-group>
					<year>1998</year>
					<article-title>A comparison of event models for naive Bayes text classification.</article-title>
					<comment>In:</comment>
					<source>Learning for text categorization: Papers from the AAAI Workshop</source>
					<comment>1998 July 27; Madison, Wisconsin. Technical Report WS-98–05.</comment>
					<publisher-loc>Menlo Park (California)</publisher-loc>
					<publisher-name>AAAI Press</publisher-name>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b44">
        <label>44</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Tateisi</surname><given-names>Y</given-names></name><name name-style="western"><surname>Ohta</surname><given-names>T</given-names></name><name name-style="western"><surname>Tsujii</surname><given-names>J</given-names></name></person-group>
					<year>2004</year>
					<source>Annotation of predicate-argument structure on molecular biology text</source>
					<comment>IJCNLP 2004 Workshop on Beyond Shallow Analysis; 2004; Hainan, China.</comment>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b45">
        <label>45</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Klein</surname><given-names>D</given-names></name><name name-style="western"><surname>Manning</surname><given-names>CD</given-names></name></person-group>
					<year>2003</year>
					<source>Accurate unlexicalized parsing</source>
					<comment>Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics; 2003; Sapporo, Japan.</comment>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b46">
        <label>46</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Klein</surname><given-names>D</given-names></name><name name-style="western"><surname>, Manning</surname><given-names>CD</given-names></name></person-group>
					<year>2002</year>
					<source>Fast exact inference with a factored model</source>
					<comment>Neural Information Processing Systems Conference; 2002. Available: <ext-link ext-link-type="uri" xlink:href="http://books.nips.cc/papers/files/nips15/CS01.pdf" xlink:type="simple">http://books.nips.cc/papers/files/nips15/CS01.pdf</ext-link>. Accessed 26 May 2005.</comment>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b47">
        <label>47</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Holger</surname><given-names>V</given-names></name></person-group>
					<year>2002</year>
					<source>TIGERin—Grafische Eingabe von Suchenfragen in TIGERSearch [diploma thesis]</source>
					<publisher-loc>Stuttgart</publisher-loc>
					<publisher-name>Universität Stuttgart</publisher-name>
					<!--===== Restructure page-count as size[@units="page"] =====--><size units="page">81</size>
					<comment>p.</comment>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b48">
        <label>48</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Mika</surname><given-names>S</given-names></name><name name-style="western"><surname>Rost</surname><given-names>B</given-names></name></person-group>
					<year>2004</year>
					<article-title>Protein names precisely peeled off free text.</article-title>
					<source>Bioinformatics</source>
					<volume>20</volume>
					<fpage>I241</fpage>
					<lpage>I247</lpage>
				</element-citation>
      </ref>
      <ref id="pcbi-0010010-b49">
        <label>49</label>
        <element-citation xlink:type="simple">
					<person-group person-group-type="author"><name name-style="western"><surname>Reiner</surname><given-names>A</given-names></name><name name-style="western"><surname>Yekutieli</surname><given-names>D</given-names></name><name name-style="western"><surname>Benjamini</surname><given-names>Y</given-names></name></person-group>
					<year>2003</year>
					<article-title>Identifying differentially expressed genes using false discovery rate controlling procedures.</article-title>
					<source>Bioinformatics</source>
					<volume>19</volume>
					<fpage>368</fpage>
					<lpage>375</lpage>
				</element-citation>
      </ref>
    </ref-list>
  </back>
</article>