^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: AKY DD. Performed the experiments: AKY DK. Analyzed the data: AKY DK DD. Contributed reagents/materials/analysis tools: AKY DK DD. Wrote the paper: AKY DD.

The statistical validation of database search results is a complex issue in bottom-up proteomics. The correct and incorrect peptide spectrum match (PSM) scores overlap significantly, making an accurate assessment of true peptide matches challenging. Since the complete separation between the true and false hits is practically never achieved, there is need for better methods and rescoring algorithms to improve upon the primary database search results. Here we describe the calibration and False Discovery Rate (FDR) estimation of database search scores through a dynamic FDR calculation method, FlexiFDR, which increases both the sensitivity and specificity of search results. Modelling a simple linear regression on the decoy hits for different charge states, the method maximized the number of true positives and reduced the number of false negatives in several standard datasets of varying complexity (18-mix, 49-mix, 200-mix) and few complex datasets (E. coli and Yeast) obtained from a wide variety of MS platforms. The net positive gain for correct spectral and peptide identifications was up to 14.81% and 6.2% respectively. The approach is applicable to different search methodologies- separate as well as concatenated database search, high mass accuracy, and semi-tryptic and modification searches. FlexiFDR was also applied to Mascot results and showed better performance than before. We have shown that appropriate threshold learnt from decoys, can be very effective in improving the database search results. FlexiFDR adapts itself to different instruments, data types and MS platforms. It learns from the decoy hits and sets a flexible threshold that automatically aligns itself to the underlying variables of data quality and size.

Database searching is an important step in high-throughput proteomics analysis and requires computational tools that can assign spectra with good statistical confidence. Due to an inherent lack of complete fragmentation knowledge it is difficult to separate the interesting spectra (containing peptide sequence information) from the uninteresting noisy ones. Controlling an expected proportion of false positives above a threshold is a useful and preferred methodology

The linear regression line of decoy hits is represented by the line equation

Our method is aimed at utilizing the information content from decoy results for increasing the sensitivity and specificity of database search results taking MassWiz

Utilizing the decoy database as a null model, we explored the decoy results to gain insights into MassWiz and Mascot score properties, understand the inherent weaknesses and improve the results, if possible. MassWiz is based on peptide fragmentation heuristics that include product ion continuity, intensities, supporting neutral losses and immonium ions customized for different mass spectrometric(MS)-platforms, imparting it good discriminative power. This has one shortcoming- as the peptide mass increases, so does the scores. This results in neglecting true hits from low mass region and accepting false hits from high mass regions. Similar but opposite effect is observed for Mascot scores. The degree of this effect is variable for various charge states and also for data sets from different MS-platforms. Therefore, proposed methods for score normalization and re-calibration (as in case of XCorr) did not work. Setting different thresholds for different mass regions using mass-bin based approaches

(A) The composition of databases used for searching standard mix datasets is shown. Database consists of standard mix proteins and common contaminants, both of which are considered true proteins (shown in green). It also consists of sequences from an unrelated organism which represent the entrapment sequences or false proteins (shown in red). The sizes of these two parts show that the true proteins were outnumbered by entrapment sequences. (B) For evaluating the FlexiFDR method, the definitions of true and false positives and negatives are relative to the unique sets identified by only one method- either FDR or FlexiFDR.

Comparison of spectra and peptides assigned by FDR (pink) and FlexiFDR (blue) for concatenated database search. The number of spectra is shown on top with the number peptides in brackets beneath them. For the standard mixtures, the true positives (green) and false positives (red) identified exclusively are highlighted. FlexiFDR identifies a higher number of true unique spectra and peptides than FDR in almost all cases. The proportion of false positives in exclusively identified set is higher in FDR than FlexiFDR. A star symbol (*) depicts that although there are non-zero true positive spectra identifications in few cases of FDR, they could not bring in any new peptide identification. The peptides they identified were already identified by other spectra (which are shared by both FDR and FlexiFDR).

The MassWiz scores were found to be correlated with peptide mass. With an increase in mass, the decoy scores increased and this effect was seen to be affected by charge state (

Top and Bottom panels depict spectra and corresponding peptide comparison from a concatenated search. The blue colored bars represent the unique true hits added by FlexiFDR alone while green colored bars represent unique true hits from FDR alone. Similarly, the pink bars denote false hits from FlexiFDR alone while red bars denote false hits from FDR alone. The spectral hits from FDR can be mapped to unique peptides right in the lower panel. The false spectral hits in case of FDR alone bring more false peptide identifications than FlexiFDR (compare bars from A to B vertically). FlexiFDR brings more unique true hits than FDR and brings lesser number of unique false hits. This enhances the true positives and decreases false positives in the datasets shown.

For comparative evaluation, the related terminology is explained in _{s} and FDR_{c} respectively. The comparisons for concatenated search are shown as Venn diagrams in

This figure denotes versatility of FlexiFDR on ppm based (plasma data), semi tryptic (QTOF) and Phosphorylation modification searches (Phospho data). Details for searches are given in methods. These different searches depict improved performance after applying FlexiFDR. Panel A shows direct Venn comparisons for the three searches. Their corresponding unique spectra are compared in bar graphs below them in panel B to show the effect.

FlexiFDR was also applied to Mascot and results for concatenated searches are shown for the unique identifications in some standard mix datasets. Except for QTOF, the other datasets showed improvement in number of spectra and peptide identifications.

Analysis of identifications unique to FDR and FlexiFDR provides a better depiction of the merit of one method over the other. A comparison of the unique identifications from FDR_{c} for standard data sets is represented as bar graphs in

Term | Definitions (also see |

True Positive | All identified matches (PSMs/Peptides) at 1% FDR that come from a standard protein or a known contaminant and found only in FlexiFDR but not simple FDR |

False Positive | All identified matches (PSMs/Peptides) at 1% FDR that come from unrelated/entrapment organism ( |

True Negative | All identified matches (PSMs/Peptides) at 1% FDR that come from |

False Negative | All identified matches at 1% FDR that correspond to the standard mix proteins and identified contaminants, and are found only in simple FDR but not FlexiFDR |

In general, it is known that lower mass peptides have a greater chance of being a false positive. By lowering the threshold in low mass region, one should expect more false positives. However, we have shown that proper threshold learnt from decoys, can be very effective in improving the results even at lower mass regions. Employing a charge based threshold allows for flexible modeling irrespective of the slope of the linear regression.

For the complex data sets from E. coli and Yeast, since the true and false identifications cannot be easily defined, we compared their identifications by showing number of spectral and peptide identifications (_{s} and FDR_{c}. Average Percentage gain in spectral identification was 8.29% and peptide identification was 7.05%. Unique identifications were enhanced by more than double increment in spectra and peptide numbers.To check whether the trends hold true for different kinds of searches, we carried out high mass accuracy searches (ppm level), searches with semitryptic option and searches with variable modifications of phosphorylation at serine, threonine and tyrosine residues. In all these searches, similar trends were observed and FlexiFDR application resulted in better performance (

To further explore the mass dependency, we tried to observe the effect on different search algorithms. We found that X!Tandem and OMSSA being dependent on calibrated e-values, do not have such bias. Interestingly, X!Tandem’s raw score, the hyper score, shows such a dependence (

This approach noticeably has many advantages- it adapts itself to different instruments, data types and MS platforms. Given any dataset, it learns from the decoys and sets a flexible threshold that automatically aligns itself to the underlying variables of data quality and size. It recovers many border line true spectra. By recovering true spectra and eliminating false ones, this method will aid in improved performance in label-free quantitation studies. It is also easily applicable to other algorithms after the correlated variables have been found. Although we have shown charge and mass dependence in this work, it could be other variables for different algorithms.

The slopes of decoy regression lines shown in this study are positive. But FlexiFDR is not restricted to work only on such data. It will work even if different charge states have different slopes including a mixture of positive and negative slopes for different charges. This has been successfully applied on Mascot results depicting its broader utility. For higher charge state data (>5), sometimes there are low number of spectra acquired. This may not be suitable for this method if the points are too few and skewed towards one side. Large datasets should benefit more from FlexiFDR method. Another related pitfall is that FlexiFDR might not work on very small datasets since it needs enough data to learn accurately from it. This is a general property of any FDR method per se and therefore it cannot be used where FDR cannot. The method is simple to use and extensible in design. It can be freely downloaded from

Several standard mixtures of increasing complexity (18 mix, 49 mix and 200 mix) along with few complex data sets of Yeast and E. coli from high resolution instruments were used to demonstrate the enhancement due to FlexiFDR. Additionally, one dataset from our previous study

The MS/MS spectra (MIX 3) were taken from 18 protein mixture

A standard 49-mix dataset downloaded from peptidome. (PSE 108) converted to mgf and collated together. This was searched against a database of 49 proteins with contaminants (true) and an appended database of Mycobacterium tuberculosis H37Rv (entrapment or false). This dataset was searched with the following parameters-trypsin enzyme with 1 missed cleavage, fixed modification of Carbamidomethylation, variable modification of Methionine oxidation, peptide tolerance 1 Da and fragment tolerance 0.8 Da.

A complex standard mix of 200 proteins, SC 200 (Seattle Children 200), developed by Bauman et al.

Mid log phase Yeast dataset

E. coli dataset

Semi tryptic search in MassWiz was carried out for QTOF dataset with similar parameters as above except for semi-tryptic cleavage. MassWiz search for Phosphorylated dataset

Effect of mass on X!Tandem hyperscore was observed on QTOF dataset searched with following parameters - 2Da precursor tolerance, 0.6 Da fragment tolerance, trypsin with one missed cleavage, fixed modification of carbamidomethyl and methionine oxidation as variable modification.

For analysis and validation of the robustness of an algorithm/analysis pipeline, a gold standard dataset is an important pre-requisite. A protein mixture with known proteins (and well known contaminants) can effectively act as a standard dataset. Several attempts at providing such standard datasets have advanced the computational proteomics field

All searches were initially conducted as separate target-decoy searches. FDR for both separate target-decoy method _{s}, was calculated as -_{c} was calculated as -

The target and decoy scores were sorted in descending order and FDR calculated at each decoy score taken as the threshold. The score at which the FDR was calculated to be 1% or immediately below 1% (i.e. FDR ≤1%) was taken as the score threshold.

Better separation of target and decoy results is an important aspect of current proteomics research. Decoy results from multiple search results were explored to understand the reasons for false positives and negatives. MassWiz decoy scores were observed to be dependent on peptide mass and charge state. Performing a linear regression on the decoy hits based on mass for different charge states provides a better alternative for FDR calculation. A bin based approach could help but that does not provide fine control while a linear regression gives a smooth threshold. It can be considered akin to an infinitely small bins approach to calculate FDR. For rescoring the results for better discrimination, the decoy scores were fit using a linear regression model against the peptide mass. This is an indirect effect caused due to peptide length and charge which are known to cause differential fragmentation in Collision induced dissociation (CID)

For implementation of FlexiFDR algorithm, the linear regression of decoy hits was modelled as an equation of a line, which provides an analytical function to adjudge the mean decoy scores at a particular mass.

In other words, using this analytical function, one can predict what would be the average random hit score for any given mass. But this is of little use directly since we are not interested in knowing the average decoy score.

By drawing a line parallel to this decoy regression line, flexible threshold for FDR can be calculated for different charge states.

Before the regression, all peptides from decoy that resembled target peptides were removed. Leu and Ile were considered as indistinguishable and thus were considered identical. Linear regression is then performed for different charge states by taking mass as independent variable and score as dependent one. After the regression line is calculated and the slope m is determined, we can calculate a parallel line through every point (with coordinate- mass, score) that gives a projection (in the form of intercept) on the y-axis. For every decoy and target score as y’, and known slope m, we calculate the intercept c’ that becomes the new score.

This is easy to calculate from the above equation. The next step calculates FDR using this new score, called FlexiScore. In effect, this rescoring brings about the desired flexible threshold using an analytical algebraic function, which in essence gives the score’s projection on y-axis after learning the trend from decoy hits. The advantage of this method is the ease of calculation, robustness and accuracy.

(TIF)

(TIF)

(TIF)

(TIF)

(TIF)

Spectra and peptide identifications from concatenated and separate database searches for the standard mix data sets.

(DOC)

Spectra and peptide identifications from separate and concatenated database searches for the E. coli and Yeast data sets.

(DOC)

The authors thank Rishi Das Roy for helpful discussions, Dr. Anurag Agrawal, Dr. V. Sabareesh and Dr. Shantanu Sengupta for insightful comments while proof-reading the manuscript. Authors thank Dr. Eugene Kolker and Natalie Kolker for their help on providing access to the SC-200 mix standard dataset for this study.

The authors also thank the reviewers for providing helpful suggestions that improved the merit and organization of the manuscript.