^{1}

^{2}

^{2}

^{3}

^{2}

^{3}

^{4}

^{5}

^{*}

RRE and AR conceived and designed the experiments. RRE performed the experiments. RRE, II, and AR analyzed the data. II contributed reagents/materials/analysis tools. AR wrote the paper.

The authors have declared that no competing interests exist.

Text-mining algorithms make mistakes in extracting facts from natural-language texts. In biomedical applications, which rely on use of text-mined data, it is critical to assess the quality (the probability that the message is correctly extracted) of individual facts—to resolve data conflicts and inconsistencies. Using a large set of almost 100,000 manually produced evaluations (most facts were independently reviewed more than once, producing independent evaluations), we implemented and tested a collection of algorithms that mimic human evaluation of facts provided by an automated information-extraction system. The performance of our best automated classifiers closely approached that of our human evaluators (ROC score close to 0.95). Our hypothesis is that, were we to use a larger number of human experts to evaluate any given sentence, we could implement an artificial-intelligence curator that would perform the classification job at least as accurately as an average individual human evaluator. We illustrated our analysis by visualizing the predicted accuracy of the text-mined relations involving the term

Current automated approaches for extracting biologically important facts from scientific articles are imperfect: while being capable of efficient, fast, and inexpensive analysis of enormous quantities of scientific prose, they make errors. To emulate the human experts evaluating the quality of the automatically extracted facts, we have developed an artificial intelligence program (“a robotic curator”) that closely approaches human experts in the quality of distinguishing the correctly extracted facts from the incorrectly extracted ones.

… he will throughly purge his floor, and gather his wheat into the garner; but he will burn up the chaff with unquenchable fire. —Matthew 3:12 [

Unfortunately, the current tools of information extraction produce imperfect, noisy results. Although even imperfect results are useful, it is highly desirable for most applications to have the ability to rank the text-derived facts by the confidence in the quality of their extraction (as we did for relations involving

Each directed arc from an entity

Several earlier studies have examined aspects of evaluating the quality of text-mined facts. For example, Sekimizu et al. and Ono et al. attempted to attribute different confidence values to different verbs that are associated with extracted relations, such as

Our goal is to design a tool that can be used with any information-extraction system developed for molecular biology. In this study, our training data came from the GeneWays project (specifically, GeneWays 6.0 database [

Our approach followed the path of supervised machine-learning. First, we generated a large training set of facts that were originally gathered by our information-extraction system, and then manually labeled as “correct” or “incorrect” by a team of human curators. Second, we used a battery of machine-learning tools to imitate computationally the work of the human evaluators. Third, we split the training set into ten parts, so that we could evaluate the significance of performance differences among the several competing machine-learning approaches.

With the help of a text-annotation company, ForScience, we generated a training set of approximately 45,000 multiply annotated unique facts, or almost 100,000 independent evaluations. These facts were originally extracted by the GeneWays pipeline, then were annotated by biology-savvy doctoral-level curators as “correct” or “incorrect,” referring to quality of information extraction. Examples of automatically extracted relations, sentences corresponding to each relation, and the labels provided by three evaluators are shown in

A Sample of Sentences That Were Used as an Input to Automated Information Extraction, Biological Relations Extracted from These Sentences, and the Corresponding Evaluations

Each extracted fact was evaluated by one, two, or three different curators. The complete evaluation set comprised 98,679 individual evaluations performed by four different people, so most of the statement–sentence pairs were evaluated multiple times, with each person evaluating a given pair at most once. In total, 13,502 statement–sentence pairs were evaluated by just one person, 10,457 by two people, 21,421 by three people, and 57 by all four people. Examples of both high inter-annotator agreement and low-agreement sentences are shown in

The statements in the training dataset were grouped into chunks; each chunk was associated with a specific biological project, such as analysis of interactions in

To facilitate evaluation, we developed a Sentence Evaluation Tool implemented in Java programming language by Mitzi Morris and Ivan Iossifov. This tool presented to an evaluator a set of annotation choices regarding each extracted fact; the choices are listed in

List of Annotation Choices Available to the Evaluators

For the convenience of representing the results of manual evaluation, we computed an evaluation score for each statement as follows. Each sentence–statement score was computed as a sum of the scores assigned by individual evaluators; for each evaluator, −1 was added if the expert believed that the presented information was extracted incorrectly, and +1 was added if he or she believed that the extraction was correct. For a set of three experts, this method permitted four possible scores: 3(1,1,1), 1(1,1,−1), −1(1,−1, −1), and −3. Similarly, for just two experts, the possible scores are 2(1,1), 0(1,−1), and −2(−1,−1) (

The objects that we want to classify, the fact–sentence pairs, have complex properties. We want to place each of the objects into one of two classes, ^{th} object (the ^{th} fact–sentence pair) comes with a set of known features or properties that we encode into a feature vector, _{i}

In the following description we use _{correct}_{incorrect}_{j}^{th} element of _{1} would have value 1 because the upstream term

The full Bayesian classifier assigns the ^{th} object to the ^{th} class if the posterior probability _{k}_{i}^{th} class than for any alternative class. This posterior probability is computed in the following way (a restated version of Bayes' theorem).

In the real-life applications, we estimate probability _{i}_{k}_{k} and_{i}_{k}

In other words, we estimate the conditional probability for every possible value of the feature vector _{i}^{th} feature and

Clearly, even for a space of only 20 binary features (^{20} −1) × 2 = 2,097,150, which exceeds several times the number of datapoints in our training set.

The most affordable approximation to the full Bayesian analysis is the Naïve Bayes classifier. It is based on the assumption of conditional independence of features:

Obviously, we can estimate _{j}_{i,j}_{k}

In an application with ^{th} feature has _{i}_{i=1}_{,n} (_{i}

We can find an intermediate ground between the full and Naïve Bayes classifiers by assuming that features in the random vector ^{th} object (_{i}_{j}^{th} cluster of features; _{i,j}^{th} object, and

The Clustered Bayes classifier is based on the following assumption about conditional independence of

We tested two versions of the Clustered Bayes classifier: one version used all 68 features (Clustered Bayes 68) with a coarser discretization of feature values; another version used a subset of 44 features (Clustered Bayes 44) but allowed for more discrete values for each continuous-valued feature, see legend to

The half-matrix below the diagonal was derived from analysis of the whole GeneWays 6.0 database; the half-matrix above the diagonal represents a correlation matrix estimated from only the manually annotated dataset. The white dotted lines outline clusters of features, suggested by analysis of the annotated dataset; we used these clusters in implementation of the Clustered Bayes classifier. We used two versions of the Clustered Bayes classifier: with all 68 features (Clustered Bayes 68), and with a subset of only 44 features but a higher number of discrete values allowed for nonbinary features (Clustered Bayes 44). The Clustered Bayes 44 classifier did not use features 1, 6, 7, 8, 9, 12, 27, 28, 31, 34, 37, 40, 42, 47, 48, 49, 52, 54, 55, 60, 62, 63, and 65.

Another method that can be viewed as an approximation to full Bayesian analysis is Discriminant Analysis invented by Sir Ronald A. Fisher [_{i}_{k}_{k}_{k}_{k}

In this study we present results for QDA; the difference from the linear discriminant analysis was insignificant for our data (unpublished data). In terms of the number of parameters to estimate, QDA uses only two symmetrical class-specific covariance matrices and the two class-specific mean vectors. For 68 features the method requires estimation of 2 × (68 × 69)/2 + 2 × 68 = 4,828 parameters.

The current version of the maximum-entropy method was formulated by E. T. Jaynes [_{k}_{i}

Parameters _{i,k}_{x,y,k}

We tested two versions of the maximum-entropy classifier. MaxEnt 1 uses only information about the first moments of features in the training data (

For two classes

A typical feed-forward artificial neural network is a directed acyclic graph organized into three (or more) layers. In our case, we chose a three-layered network, with a set of nodes of the _{i}_{i=}_{1,…,Nx}, nodes of the _{j}_{j=}_{1,…,Ny}, and a single node representing the _{1}, see _{x,}_{y,}_{y}_{y}

We used a similar network with 68 input units (one unit per classification feature) and ten hidden-layer units.

The values of the input nodes, {_{i}_{i=}_{1,…,Nx,} are feature values of the object that we need to classify. The value of each node, _{j},_{j,k}_{1} is determined as a linear combination of the values of all hidden nodes:
_{k}_{1} corresponded to the class _{x}_{y}_{y}

The Support Vector Machines (SVM, [

The SVM is a ^{L} (_{1},_{2}) = 〈_{1},_{2}〉 is simply the inner product of the two input feature vectors; an SVM with the linear kernel searches for a class-separating hyperplane in the original space of the data. Using a polynomial kernel,

In the most real-world cases the two classes cannot be separated perfectly by a hyperplane, and some classification errors are unavoidable. SVM algorithms use the

Parameter Values Used for Various SVM Classifiers in This Study

The output of an SVM analysis is not probabilistic, but there are tools to convert an SVM classification output into “posterior probabilities” (see chapter by J. Platt in [

The number of support vectors used by the SVM classifier depends on the size and properties of the training dataset. The average number of (1 × 68-dimensional) support vectors used in ten cross-validation experiments was 12,757.5, 11,994.4, 12,092, 12,289.9, 12,679.7, and 14,163.8, for SVM, SVM-t1-d2, SVM-t1-d3, SVM-t2-g0.5, SVM-t2-g1, and SVM-t2-g2 classifiers, respectively. The total number of data-derived values (which we loosely call “parameters”) used by the SVM in our cross-validation experiments was therefore, on average, between 827,614 and 880,270 for various SVM versions.

We implemented the meta-classifier on the basis of the SVM algorithm (linear kernel with

A summary of the sources of software used in our study is shown in

Machine Learning Methods Used in This Study and Their Implementations

We selected 68 individual features covering a range of characteristics that could help in the classification (see

List of the Features That We Used in the Present Study

To represent the second-order features (pairs of features), we defined a new feature as a product of the normalized values of two features. We obtained the normalized values of features by subtracting the mean value from each feature value, then dividing the result by the standard deviation for this feature.

After a number of feature-selection experiments for the MaxEnt 2 method, we settled on using

To evaluate the success of our classifiers we used a 10-fold cross-validation approach, where we used

To quantify and compare success of the various classification methods, we used receiver operating characteristic (ROC) scores, also called

An ROC score is computed in the following way. All test-set predictions of a particular classification method are ordered by the decreasing quality score provided by this method; for example, in the case of the Clustered Bayes algorithm, the quality score is the posterior probability that the test object belongs to the class

We show only the linear-kernel SVM and the Clustered Bayes 44 ROC curves to avoid excessive data clutter.

The ROC score is an estimate of the probability that the classifier under scrutiny will label correctly a pair of statements, one of which is from the class

The raw extracted facts produced by our system are noisy. Although many relation types are extracted with accuracy above 80%, and even above 90% (see

The accuracy was computed by averaging over all individual specific information extraction examples manually evaluated by the human curators. The plot compactly represents both the per-relation accuracy of the extraction process (indicated with the length of the corresponding bar) and the abundance of the corresponding relations in the database (represented by the bar color). There are relations extracted with a high precision; there are also many noisy relationships. The database accuracy was markedly increased by the automated curation outlined in this study (see

The classification problem of separating correctly and incorrectly extracted facts appears to belong to a class of easier problems. Even the simplest Naïve Bayes method had an average ROC score of 0.84, which more sophisticated approaches surpassed to reach almost 0.95. Judging by the average ROC score, the quality of prediction increased in the following order of methods: Clustered Bayes 68, Naïve Bayes, MaxEnt 1, Clustered Bayes 44, QDA, artificial neural network, SVMs, and MaxEnt 2/MaxEnt 2-v (see

ROC Scores for Methods Used in This Study, with Error Bars Calculated in 10-Fold Cross-Validation

It is a matter of both academic curiosity and of practical importance to know how the performance of our artificial intelligence curator compares with that of humans. If we define the

Comparison of the Performance of Human Evaluators and of the MaxEnt 2 Algorithm

The features that we used in our analysis are obviously not all equally important. To elucidate the relative importance of the individual features and of feature pairs, we computed the mutual information between all pairs of features and the class variable (see _{i},F_{j}

The plot indicates that a significant amount of information critical for classification is encoded in pairs of weakly correlated features. The white dotted lines outline clusters of features, suggested by analysis of the annotated dataset; we used these clusters in implementation of the Clustered Bayes classifier.

Assignment of facts to classes

Comparison of Human Evaluators and a Program That Mimicked Their Work

As evidenced by

Precision is defined as

Experiments with any standalone set of data generate results insufficient to allow us to draw conclusions about the general performance of different classifiers. Nevertheless, we can speculate about the reasons for the observed differences in performance of the methods when applied to our data. The modest performance of the Naïve Bayes classifier is unsurprising: we know that many pairs of features used in our analysis are highly or weakly correlated (see

This plot represents both the per-relation accuracy after both information extraction and automated curation were done. Accuracy is indicated with the length of the relation-specific bars, while the abundance of the corresponding relations in the manually curated dataset is represented by color. Here, the MaxEnt 2 method was used for the automated curation. The results shown correspond to a score-based decision threshold set to zero; that is, all negative-score predictions were treated as “incorrect.” An increase in the score-based decision boundary can raise the precision of the output at the expense of a decrease in the recall (see

Our explanation for the superior performance of the MaxEnt 2 algorithm with respect to the remainder of the algorithms in the study batch is that MaxEnt 2 requires the least parameter tweaking in comparison with other methods of similar complexity. Performance of the Clustered Bayes method is highly sensitive to the definition of feature clusters and to the way we discretize the feature values—essentially presenting the problem of selecting an optimal model from an extensive set of rival models, each model defined by a specific set of feature clusters. Our initial intuition was that a reasonable choice of clusters can become clear from analysis of an estimated feature-correlation matrix. We originally expected that more highly correlated parameters would belong to the same cluster. However, the correlation matrices estimated from the complete GeneWays 6.0 database and from a subset of annotated facts turned out to be rather different—see

Similar to optimizing the Clustered Bayes algorithm through model selection, we can experiment with various kernel functions in the SVM algorithm, and can try alternative designs of the artificial neural network. These optimization experiments are likely to be computationally expensive, but are almost certain to improve the prediction quality. Furthermore, there are bound to exist additional useful classification features waiting to be discovered in future analyses. Finally, we speculate that we can improve the quality of the classifier by increasing the number of human evaluators who annotate each datapoint in the training set. This would allow us to improve the gold standard itself, and could lead to development of a computer program that performs the curation job consistently at least as accurately as an average human evaluator.

(47 KB PDF)

The authors are grateful to Mr. Murat Cokol and to Ms. Lyn Dupré Oppenheim for comments on the earlier version of the manuscript, and to Mr. Marc Hadfield and to Ms. Mitzi Morris for programming assistance.

Quadratic Discriminant Analysis

radial basis function

receiver operating charecteristic

support vector machines