^{1}

^{1}

^{2}

^{3}

^{4}

^{1}

The authors have declared that no competing interests exist.

Detection of protein structure similarity is a central challenge in structural bioinformatics. Comparisons are usually performed at the polypeptide chain level, however the functional form of a protein within the cell is often an oligomer. This fact, together with recent growth of oligomeric structures in the Protein Data Bank (PDB), demands more efficient approaches to oligomeric assembly alignment/retrieval. Traditional methods use atom level information, which can be complicated by the presence of topological permutations within a polypeptide chain and/or subunit rearrangements. These challenges can be overcome by comparing electron density volumes directly. But, brute force alignment of 3D data is a compute intensive search problem. We developed a 3D Zernike moment normalization procedure to orient electron density volumes and assess similarity with unprecedented speed. Similarity searching with this approach enables real-time retrieval of proteins/protein assemblies resembling a target, from PDB or user input, together with resulting alignments (

Protein structures possess wildly varied shapes, but patterns at different levels are frequently reused by nature. Finding and classifying these similarities is fundamental to understand evolution. Given the continued growth in the number of known protein structures in the Protein Data Bank, the task of comparing them to find the common patterns is becoming increasingly complicated. This is especially true when considering complete protein assemblies with several polypeptide chains, where the large sizes further complicate the issue. Here we present a novel method that can detect similarity between protein shapes and that works equally fast for any size of proteins or assemblies. The method looks at proteins as volumes of density distribution, departing from what is more usual in the field: similarity assessment based on atomic coordinates and chain connectivity. A volumetric function is amenable to be decomposed with a mathematical tool known as 3D Zernike polynomials, resulting in a compact description as vectors of Zernike moments. The tool was introduced in the 1990s, when it was suggested that the moments could be normalized to be invariant to rotations without losing information. Here we demonstrate that in fact this normalization is possible and that it offers a much more accurate method for assessing similarity between shapes, when compared to previous attempts.

Structure similarity searching within the growing PDB archive [

Structure superposition tools were initially developed in the 1970s [

Protein functional units are, however, not necessarily confined to the boundaries of domains or individual chains. They are often oligomeric, sometimes with multiple distinct quaternary structures resulting in similar functional units. Today, approximately half of the structures in the PDB are oligomeric (as of April 2020). In the wake of the 3DEM “resolution revolution” the fraction of oligomeric structures represented in the archive is growing year-on-year.

The ever-increasing amount of structural data, combined with rising complexity of the structures, requires development of faster, more accurate methods to process and classify structure similarity. Traditional comparison methods use atom level information, which can be complicated by the presence of topological permutations within a polypeptide chain and/or subunit rearrangement(s) within an oligomeric assembly. While solutions that address these problems exist [

Alternative approaches looking beyond purely atomic information have been explored. One utilizes geometric descriptors,

Herein, we exploit a 3D Zernike moment normalization procedure to implicitly orient electron density volumes and assess similarity in moment space with unprecedented speed. The general approach was suggested in [

Based on these principles, we have developed a search system that enables real-time retrieval of similar protein assemblies to a target assembly, obtained from the PDB or uploaded by a user, together with their alignment (

We follow the derivations by Canterakis in [_{nl} are the radius-dependent normalizing factors.

Top Layer: values of weighted 3D Zernike functions of order

The moments can be expressed as a (2

As in [

To obtain the rotated value for a particular ^{m}. For example,

As the SO(3) group has three degrees of freedom and Zernike moments are complex numbers, we follow by setting one moment and the imaginary part of another to zeros (see

(a) Rotational degrees of freedom are fixed by constraining values of chosen moments with respect to the 3D rotation group. The solution defines a rotation of the weighted 3D Zernike functions to a ‘standard’ position. (b) Alignment of two structures of human transportin 3: in unliganded form (4C0P) and in complex with ASF/SF2 (4C0O). Normalization order 2 is equivalent to alignment of the densities’ principal axes. Normalization order 6 matches finer detail in the density, such as

Let us continue with the example ^{th}

Next, we fix one more degree of freedom by setting the imaginary part of another moment to 0. Let us choose ^{nd}

Assuming ℜ{

Finally, the rotated 3D Zernike moments can be obtained from:

BioZernike descriptors include two rotation-invariant shape descriptors: one based on the Canterakis Norms (CNs) and one based on the simple geometric features (GEO). In addition, we provide a CN-based alignment descriptor (

Every atomic structure in the PDB (a) is converted to a volume by selecting representative atoms per residue (b) and placing a gaussian density in their place (c). The geometric features (GEO) can be calculated directly from the representative atoms coordinates, whilst the Zernike moments and their Canterakis Norms of various orders are calculated out of the volume (note that different normalizations are offset in

For the 3D Zernike moments calculation, the structure coordinates are converted to the volumetric representation as follows. First, the grid width is chosen in the range 0.25Å–16Å to keep the volume’s average dimension between 50Å and 200Å, if possible. Subsequently, for every representative atom a Gaussian density is placed into the volume that corresponds to the amino acid/nucleotide weight and spherically averaged size. Representative atoms are defined as C

For the vector of geometric features GEO, we calculate the distance distribution from the center of mass of the structure to all its representative atoms. Next, we include in the vector moments of this distribution: standard deviation, skewness, kurtosis, as well as 10^{th}, 20^{th}, …90^{th} percentiles. In addition, we include the structure radius of gyration, nominal molecular weight, and standard deviation of the coordinates along the principal axes, corresponding to the dimensions of the structure. The final GEO descriptor has 17 components.

The alignment descriptor consists of two components: complete 3D Zernike moments calculated up to the order of 6 and the coordinates of the structure’s center of mass (required because this information is not preserved by the volume scaling procedure). To perform structure alignment, we compute all possible CNs of the given moments and find a normalization _{opt} (and the induced rotation) that minimizes distance _{1}, _{2}|_{1} and _{2} as follows:

Then the optimal rotation _{opt} is selected by

The Eqs

After applying the rotation, the structures are superposed using coordinates of their centers of mass.

For the distance function training, we prepared a CATH-based dataset as follows: A non-redundant subset of CATH domains with up to 40% sequence identity was obtained from

We defined the distance function for the composite BioZernike feature vectors as:
_{1} and _{2} are the geometric feature vectors being compared, _{1} and _{2} are the CN-based feature vectors, and _{g} and _{m} are the respective weights.

The weights were fitted to the training set using regularized logistic regression. 10-fold cross-validation was performed on the superfamily level. The regularization parameter that maximized the Matthews correlation coefficient of the predictions of the excluded data was used for the final training with the entire dataset. Learned weight coefficients were constrained to non-negative values, which led to sparse solutions (

Importantly, the obtained distance function is by no means definitive, but rather an illustration of a general approach. The procedure can (and should) be repeated with the problem-specific training sets, yielding appropriate functions based on the BioZernike descriptors.

The domain test set was prepared based on the independent ECOD subset using the same procedure as for the CATH-based training set. Additionally, if an F-group representative domain could be aligned to any CATH superfamily representative domain with TM-score 0.75 or more, the group was excluded from the test set. Ultimately, 761 domain structures (divided among 34 families) remained. All-

500 biological assemblies were randomly selected from all PDB entries such that no two assemblies have density correlation score [

3D-Surfer 3DZD descriptors were obtained directly from the web server using default parameters. During the course of this work, we discovered a bug in the original 3DZD library [

As Omokage score is not available for use with arbitrary structures, we implemented the procedure described in [

We achieve fast similarity computation using protein structural descriptors. Robust descriptors must capture information relevant to their intended use (e.g., binding sites for virtual drug screening, solvent-accessible surfaces for protein docking, structural organization for establishing functional or evolutionary relationships) while being inexpensive to compute, quick to compare with other descriptors, and readily interpretable. Most high-throughput structure analysis pipelines involve balancing the tradeoff between speed and accuracy of the underlying representation.

3D Zernike moments, derived by Canterakis in [

Limiting their use, 3D Zernike moments are not invariant under rotation. While special properties of the spherical harmonic functions can be exploited to align two sets of moments, the resulting procedure is slower than classical coordinate-based methods [

In his work Canterakis [

Here we generalize the approach of Canterakis by developing normalization routines with wider applicability. These routines yield complete, rotationally invariant 3D Zernike moments (referred to hereafter as Canterakis Norms or CNs). Conceptually, a CN rotates an object so that selected moments become equal to predefined values. This orients an object in a uniquely determined standard position (

CNs immediately give rise to a computationally inexpensive global structure alignment. Indeed, if the same moments are normalized to their standard values for two objects, their induced standard positions are likewise equivalent (

It has to be noted that the system of polynomial equations mentioned above has multiple solutions. While there is a theoretically sound approach to break this ambiguity described in [

Finally, but most importantly, the complete moments of objects oriented in the same standard position are

Our search procedure is depicted schematically in

The simpler 3DZD descriptors are usually calculated for an object surface, following the original implementation [

CNs of different orders may be more or less appropriate for various shapes. Moreover, resolving ambiguity for multiple solutions depends on a particular symmetry that an object may possess. To make the CN-based descriptor versatile while retaining performance, we used several CNs of orders 2 to 5 and then average the absolute values of the solutions (

The 3D Zernike moments are defined for objects scaled to a unit ball which loses size-related information. To compensate for this fact, we developed a geometry-based descriptor (GEO). It includes features that can be quickly obtained from the set of representative atoms, such as structure dimensions along its principal axes or statistical properties of the interatomic distance distribution (see

Together the CN-based and the GEO descriptors constitute what we term a BioZernike descriptor. In order to judge similarity of 2 structures, we developed a distance measure that compares their BioZernike descriptors. We hypothesized that the often-used Euclidean distance is suboptimal choice for comparing 3D Zernike moments-based descriptors, because of the hierarchical structure of the representation (

Retrieval of similar structures was evaluated on a non-redundant subset of ECOD [

As shown in

Non-redundant sets of domains (a,b) and assemblies (c,d) were used for the evaluation. Receiver operating characteristic (a,c) and precision-recall (b,d) curves are shown. For the domain set, performance of the Canterakis norms (CNs) is plotted separately as well as in conjunction with the geometry descriptor (GEO). ‘3DZD (our implementation)’ corresponds to our implementation of the 3DZD descriptors that takes into account the whole density distribution, rather than the protein surface only.

BioZernike components | Reference methods | ||||||
---|---|---|---|---|---|---|---|

Weighted CN | Weighted 3DZD | Density 3DZD | GEO | Weighted CN+GEO | 3DSurfer 3DZD | Omokage | |

ROC AUC | 0.95 | 0.94 | 0.93 | 0.94 | 0.97 | 0.85 | 0.84 |

PR AUC | 0.76 | 0.73 | 0.71 | 0.59 | 0.79 | 0.39 | 0.38 |

MCC | 0.69 | 0.66 | 0.65 | 0.54 | 0.72 | 0.38 | 0.38 |

ROC AUC | – | – | – | 1.0 | 1.0 | 0.95 | 0.99 |

PR AUC | – | – | – | 0.79 | 0.94 | 0.71 | 0.71 |

MCC | – | – | – | 0.75 | 0.93 | 0.70 | 0.71 |

An important result of this study is an open-source, customizable library that implements all routines required to obtain a BioZernike descriptor starting from a protein structure

The BioZernike library includes structure-to-volume conversion based on the

The library is developed and continuously validated for processing of large amounts of structural data, such as those at RCSB PDB, which leads to a highly optimized and efficient implementation. For example, calculating a full BioZernike descriptor for PDB ID 5J7V (the largest macromolecular assembly represented in the PDB archive at the time of writing; 8280 component homo-oligomer containing 5,340,600 amino acid residues) takes ∼10 seconds. For a more typical oligomeric PDB structure, such as PDB ID 4HHB (hemoglobin _{2} _{2} hetero-tetramer containing 574 residues) the processing time is ∼30 milliseconds. The full processing for the entire archive as of November 2019 (all assemblies and all polymeric chains) takes 7 hours using 6 parallel threads. The time needed for descriptors comparison and moment alignment is negligible. The performance of the BioZernike library is showcased on the website shape.rcsb.org. The library’s comprehensive implementation, together with its flexibility and speed makes it especially useful for developers who wish to create novel applications for classification and comparison of structural data.

The rcsb.org main website also makes use of the BioZernike library since the April 2020 release. The integration enhances the applicability of this system by combining structure search with other types of searches. Two structure search modes are made available at rcsb.org: “strict” and “relaxed”. The modes correspond to two different threshold sets, based on training against the assemblies dataset (see

A very interesting set of comparisons is that of folds that are conserved globally across evolution while the subcomponents have been re-arranged in different ways as a result of gene fusion or duplication. The different biological assemblies are thus very similar overall but have different stoichiometries.

A striking example of this property is the Macrophage migration inhibitory factor (MIF). These tautomerase enzymes are conserved across the entire tree of life (

(a,b) Global shape search finds similar assemblies regardless of a particular stoichiometry. (c) Search-by-parts mode allows discovery of the ‘transformer’ proteins that form different assemblies from similar components.

Another well-known case in this category is that of the DNA-clamps (overall quasi D6 symmetry, with A3 stoichiometry in archaea and eukaryotes and A2 in bacteria) [

Domain swaps are a further example that belong to this category and represent a widespread phenomenon in structural biology [

Another case we can easily study and find using the BioZernike system involves divergent quaternary structure assemblies composed by the identical or highly similar subunits.

For example, the TRAP (trp RNA-binding attenuation protein) proteins from

Other well-known cases were revealed through a mutation or an environmental change (

Our search system makes possible automatic identification of such proteins. In this case, the search can be performed only on parts of an assembly, such as an individual chain or a domain. Moreover, a distance function can be designed specifically to focus on higher-order Zernike moments so that structural features of the sub-components are weighted more heavily than the overall shape of the assembly.

Protein structure retrieval and alignment with volumetric data provides several advantages

At the same time, volumetric data preserves information content far better than shape based on surface representation. Thus, as the benchmarks above demonstrate, there are clear advantages of our volume-based system when compared to surface based methods [

On the negative side, one important disadvantage of this method is the inability to find local matches, e.g. finding a conserved domain between 2 chains that have different domain architecture and thus do not match globally. However there are ways of working around the problem, for instance domain decomposition of both query and database to perform searches specific to domains.

A key to the success of our method is the fact that the newly derived Canterakis normalizations (CNs) preserve much more information than the widely used 3DZDs. This is clearly demonstrated by the alignments that are naturally obtained from CNs.

As shown in the benchmark, our system outperforms other fast descriptor-based search approaches in terms of both precision and sensitivity. At the same time, it allows real-time (milliseconds) PDB archive-wide retrieval without the need to resort to ad-hoc strategies for speeding up the calculation. Run times of currently available services (Dali, TopSearch, PDBeFold) are measured in seconds or minutes solely because of additional speed-up strategies like pre-clustering or parallelization. Our system’s faster performance applies equally to user input atomic coordinates, adding only minimal overhead typically measured in milliseconds. Such a system constitutes a valuable tool for structural biologists, allowing for real-time hypotheses generation at the conclusion of a structure determination campaign.

Importantly, the speed and accuracy of this method opens up the possibility of automated structural classification at any level, an avenue that we shall explore in future work. One interesting application is multiple structure profiles: using normalized complete moments to parameterize entire protein families. A related ‘consensus shapes’ notion was introduced in [

The score calculated by our implementation (axis

(TIF)

Dear Dr Duarte,

Thank you very much for submitting your manuscript "Preparing for the 3D data deluge: real time structural search at the scale of the PDB and beyond" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Charlotte M Deane

Associate Editor

PLOS Computational Biology

Arne Elofsson

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact

[LINK]

Reviewer's Responses to Questions

Reviewer #1: Review comments attached as pdf. Predominantly, what is needed is the clarification of the method through the use of non-ambiguous language.

Reviewer #2: The authors present an approach for fast retrieval of similar structures from PDB, using 3D Zernike moments. The proposed approach is an improvement over existing tools and the speed of the method is a big advantage and shows better performance in authors benchmark. The implementation seems robust and is a useful tool for the community. I have a few minor points that will help with user interpretation of results.

1) How does the approach perform (and rank) in case of partial alignments where only part of the larger protein is matched?

2) What are the safe thresholds of the total score to find close and distantly related structures? How much does Zernike descriptors contribute to the total score in general? A discussion would help users.

3) The method can deal with EM volumes but the current implementation and tests doesn’t include these. It is useful to discuss this although I noted that this is mentioned as part of future developments.

**********

Large-scale datasets should be made available via a public repository as described in the

Reviewer #1: No: The approach used for generating the dataset is included but seemingly not the list of structures that resulted. list/spreadsheet of optimised zernike/geometric distance metric weights could also be included.

Reviewer #2: No: The benchmark datasets used for tests

**********

PLOS authors have the option to publish the peer review history of their article (

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Reviewer #1: No

Reviewer #2: No

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool,

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see

Submitted filename:

Submitted filename:

Dear Dr Duarte,

We are pleased to inform you that your manuscript 'Real time structural search of the Protein Data Bank' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Charlotte M Deane

Associate Editor

PLOS Computational Biology

Arne Elofsson

Deputy Editor

PLOS Computational Biology

***********************************************************

PCOMPBIOL-D-19-01993R1

Real time structural search of the Protein Data Bank

Dear Dr Duarte,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Laura Mallard

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom