^{1}

^{1}

^{2}

The authors have declared that no competing interests exist.

The _{nat}, LRMS, and iRMS as proposed and standardized by CAPRI. These quality measures quantify different aspects of the quality of a particular docking model and need to be viewed together to reveal the true quality, e.g. a model with relatively poor LRMS (>10Å) might still qualify as 'acceptable' with a descent F_{nat} (>0.50) and iRMS (<3.0Å). This is also the reason why the so called CAPRI criteria for assessing the quality of docking models is defined by applying various _{nat}, LRMS, and iRMS to a single score in the range [0, 1] that can be used to assess the quality of protein docking models. By using DockQ on CAPRI models it is possible to almost completely reproduce the original CAPRI classification into Incorrect, Acceptable, Medium and High quality. An average PPV of 94% at 90% Recall demonstrating that there is no need to apply predefined ad-hoc cutoffs to classify docking models. Since DockQ recapitulates the CAPRI classification almost perfectly, it can be viewed as a higher resolution version of the CAPRI classification, making it possible to estimate model quality in a more quantitative way using Z-scores or sum of top ranked models, which has been so valuable for the CASP community. The possibility to directly correlate a quality measure to a scoring function has been crucial for the development of scoring functions for protein structure prediction, and DockQ should be useful in a similar development in the protein docking field. DockQ is available at

Protein-Protein Interactions (PPI) are involved in almost all biological processes. To understand these processes the structure of the protein complex is essential. Despite significant efforts in traditional structural biology and the structural genomics projects that aim at high-throughput complex structure determination [^{α}-RMSD, GDT_TS [_{nat}, LRMS and iRMS as proposed and standardized by the Critical Assessment of PRedicted Interactions (CAPRI) community [_{nat} is then defined as the fraction of native interfacial contacts preserved in the interface of the predicted complex. LRMS is the Ligand Root Mean Square deviation calculated for the backbone of the shorter chain (ligand) of the model after superposition of the longer chain (receptor) [_{nat}. The backbone atoms of these 'interface' residues is then superposed on their equivalents in the predicted complex (model) to compute the iRMS [_{nat} < 0.1 or (LRMS > 10 and iRMS > 4.0)), Acceptable ((F_{nat} ≥ 0.1 and F_{nat} < 0.3) and (LRMS ≤ 10.0 or iRMS ≤ 4.0) or (F_{nat} ≥ 0.3 and LRMS > 5.0 and iRMS > 2.0)), Medium ((F_{nat} ≥ 0.3 and F_{nat} < 0.5) and (LRMS ≤ 5.0 or iRMS ≤ 2.0) or (F_{nat} ≥ 0.5 and LRMS > 1.0 and iRMS > 1.0)), or High (F_{nat} ≥ 0.5 and (LRMS ≤ 1.0 or iRMS ≤ 1.0)) [_{nat}, LRMS, and iRMS to yield a score in the range [0, 1], corresponding to low and high quality, respectively. This new measure can essentially be used to recapitulate the original CAPRI classification, and be used for more detailed analyses of similarity and prediction performance.

Furthermore, the recent growth in using machine learning methods to score models would not have been possible if there would not have been a development of single quality measures, like TM-score, GDT_TS and S-score to serve as target functions. These methods have been successful in CASP for predicting the quality of protein structure models [_{nat}, LRMS, iRMS) as well as the classification into incorrect, acceptable, medium and high quality models could potentially be used as target functions in regression or classification schemes, it is natural and more convenient to use the combined single measure, DockQ, which covers all the different quality attributes, captured by the individual CAPRI measures. The potential use of DockQ as a target function in the design of a docking scoring function by training support vector regression machines to predict quality of docking models has already been demonstrated in a separate study [

A set from a recent benchmark of docking scoring function [

For independent testing, a subset based on the CAPRI Score_set [_{nat}, LRMS, iRMS) was assembled. For simplicity, two targets with multiple correct chain packings, i.e. same sequence binding at two different locations, were removed (Target37: 2W83, Target 40: 3E8L). The final

To avoid the problem of arbitrarily large RMS values that are essentially equally bad, RMS values were scaled using the inverse square scaling technique adapted from the S-score formula [_{scaled} represents the scaled RMS deviations corresponding to any of the two terms, LRMS or iRMS (_{i} is a scaling factor, _{1} for LRMS and _{2} for iRMS, optimized to d_{1} = 8.5Å and d_{2} = 1.5Å (see

The hallmark of inverse-square scaling is the asymptotic smooth declination of the scaled function (Y) with gradual increase of the raw score (X) (Figure A in

The aim of this study was to derive a continuous quality measure that can be used to rank docking models and compare performances of methods scoring docking models in a direct way. To make it simple and promote wide-acceptance, we chose to base the scoring function, named DockQ, on the already established quality measures for docking F_{nat}, LRMS, and iRMS used in CAPRI [_{nat}, LRMS, and iRMS into one score by the mean of F_{nat}, and the two RMS values scaled according to _{scaled}_{1} and _{2} are scaling parameters that determines how fast large RMS values should be scaled to zero, and needs to be set based on the score range for LRMS and iRMS. The advantages of the non-linear scaling of the RMS values is that the function (_{scaled} score.

The two parameters in the DockQ score, _{1} and _{2}, were optimized in a grid search on the MOAL-set by calculating _{1} and _{2} in the range 0.5 to 10Å for _{1}, and 0.5 to 5Å for _{2} in steps of 0.5. For each _{1},_{2}_{1} and d_{2}. The maximum average F1-score (0.91) was obtained for d_{1} = 8.5Å and d_{2} = 1.5Å (Figure B in

The colored bars corresponding to each CAPRI class represents frequency distribution of models predicted to be falling in a particular class normalized by the total number of models in that class.

The CAPRI-set was used as an independent benchmark to assess DockQ performance and compare it to IS-score, which is similar in its design to TM-score for protein structure prediction, but for interfaces. The cutoffs optimized on MOAL-set are close to optimal also for the CAPRI-set within ±0.02 (

Models are colored according to CAPRI classification as Incorrect (blue), Acceptable (cyan), Medium (red), High (green). The overall correlation is R = 0.98, while the correlation within the different quality classes is 0.77, 0.82, 0.90, and 0.65, respectively.

The area under the curves (AUC) for DockQ and IS-score are (0.98, 0.99, 0.97) and (0.89, 0.92, 0.82) respectively for (A) Acceptable, (B) Medium and (C) High.

Assessing dimer quality is a current challenge in CASP and CAPRI. In view with this, the DockQ software has been built with the functionality to deal with interacting multi-chains. Monomer-dimer or dimer-dimer interfaces are common in, for example, antigen-antibody interactions, due to the internal symmetry in the biological assembly of the heavy and light variable chains of the immunoglobulin, where the partner-antigen can potentially bind asymmetrically at the antigen binding sites [

DockQ is a continuous protein-protein docking model quality score, performing as good as the three original CAPRI measures (F_{nat}, LRMS, iRMS) in segregating the models in the four different CAPRI quality classes. If the CAPRI measures are already calculated it is simple to calculate DockQ using _{1} = 8.5 and d_{2} = 1.5. Since DockQ essentially recapitulates the CAPRI classification almost perfectly, it can be viewed as a higher resolution version of the CAPRI classification. The fact that it is continuous makes it possible to estimate model quality in a more quantitative way using Z-scores or sum of top ranked models, which has been so valuable for the CASP community. It should also be very useful for comparing the performance of energy functions used for ranking and scoring docking models in more detail, by analyzing complete rankings (not only top ranked), correlations, and DockQ vs. energy scatter plots. In addition, DockQ can be used as a target function in developing new knowledge-based scoring functions using for instance machine learning, a feature that has been investigated in a separate study [

_{1} and d_{2} on the MOAL-set. Each value is smoothed by taking an average over its nearest neighbors to remove the effect of outliers.

(PDF)