^{1}

^{1}

^{1}

^{2}

^{2}

^{3}

^{3}

^{4}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: SG FL MM DT SM EB XL JC. Performed the experiments: SG FL MM DT SM EB XL JC. Analyzed the data: SG FL MM DT SM EB XL JC. Contributed reagents/materials/analysis tools: SG FL MM DT SM EB XL JC. Wrote the paper: SG FL MM DT SM EB XL JC.

The discovery of peptides possessing high biological activity is very challenging due to the enormous diversity for which only a minority have the desired properties. To lower cost and reduce the time to obtain promising peptides, machine learning approaches can greatly assist in the process and even partly replace expensive laboratory experiments by learning a predictor with existing data or with a smaller amount of data generation. Unfortunately, once the model is learned, selecting peptides having the greatest predicted bioactivity often requires a prohibitive amount of computational time. For this combinatorial problem, heuristics and stochastic optimization methods are not guaranteed to find adequate solutions. We focused on recent advances in kernel methods and machine learning to learn a predictive model with proven success. For this type of model, we propose an efficient algorithm based on graph theory, that is guaranteed to find the peptides for which the model predicts maximal bioactivity. We also present a second algorithm capable of sorting the peptides of maximal bioactivity. Extensive analyses demonstrate how these algorithms can be part of an iterative combinatorial chemistry procedure to speed up the discovery and the validation of peptide leads. Moreover, the proposed approach does not require the use of known ligands for the target protein since it can leverage recent multi-target machine learning predictors where ligands for similar targets can serve as initial training data. Finally, we validated the proposed approach in vitro with the discovery of new cationic antimicrobial peptides. Source code freely available at

Part of the complexity of drug discovery is the sheer chemical diversity to explore combined to all requirements a compound must meet to become a commercial drug. Hence, it makes sense to automate this chemical exploration endeavor in a wise, informed, and efficient fashion. Here, we focused on peptides as they have properties that make them excellent drug starting points. Machine learning techniques may replace expensive

Drug discovery faces important challenges in terms of cost, complexity and the amount of time required to yield promising compounds. To avoid side effects, a valuable drug precursor must have high affinity with the target protein while minimizing interactions with other proteins. Unfortunately, only a few have such properties and these have to be identified from an astronomical number of candidate compounds. Other factors, such as bioavailability and stability have to be considered; but this combinatorial search problem, by itself, is very challenging [

For novel and less studied targets, screening compound libraries remain the method of choice for rapid data generation. To fully exploit the great conformational and functional diversity, combinatorial peptide chemistry is certainly a powerful tool [

However, it is important to note that combinatorial peptide chemistry cannot cover a significant part of the peptide diversity when peptides are longer than a few amino acids. For example, 2^{9} peptides. Consequently, it is almost certain that the best peptides will not be present and most synthesized peptides will have low bioactivity. Hence, drug discovery is a combinatorial problem which, unfortunately, cannot be solved using combinatorial chemistry alone. The process of discovering novel compounds with both high bioactivity and low toxicity must therefore be optimized.

Machine learning and kernel methods [

This work explores the use of learning algorithms to design and enhance the pharmaceutical properties of compounds [_{50}, etc), we expect that a state-of-the-art kernel method will give a bioactivity model which is sufficiently accurate to find new peptides with activities higher than the 100 used to learn the model. This is possible because each peptide that possesses a small binding affinity contains information about subsequences of residues that can bind to the target. Learning a model can accelerate, but not solve, this costly process.

We demonstrate that for a large class of kernel based models, it is possible to design an efficient algorithm guaranteed to find the peptide of maximal predicted bioactivity. This algorithm makes use of graph theory and recent work [

String kernels are symmetric positive semi-definite similarity functions between strings. In our context, strings are sequences of amino acids. Such kernels have been widely used in applications of machine learning to biology. For example, the local-alignment kernel [

To characterize the similarity between peptides, two different

Meinicke and colleagues [_{p} is a parameter that controls the length of the decay.

Toussaint and colleagues [_{1}(_{2}(_{d}(_{i}(_{1}, …, _{k} and

More recently, the Generic String (GS) kernel was proposed for small biological sequences and pseudo-sequences of binding interfaces [_{p}, _{c} are chosen by cross-validation.

This GS kernel is very versatile since, depending on the chosen hyper-parameters, it can be specialized to eight known kernels [

Recently [

In the binary classification setting, the learning task is to predict whether a peptide has a specific property such as binding to a target molecule. In the regression setting, the learning task is to predict a real value that quantifies the quality of a peptide, for example, its bioactivity, inhibitory concentration, binding affinity, or bioavailability. In contrast to classification and regression, the task we consider here (described in the next section) is ultimately to predict a string of amino acids.

In this paper, each learning example ((

A predictor is a function

In contrast, a predictor _{y}(

Given a training set {((_{1}, _{1}), _{1}), …, ((_{m}, _{m}), _{m})}, a large class of learning algorithms produce multi-target predictors _{q} is the weight on the _{q}(_{y} from a multi-target predictor, we use _{y} is target-specific predictor learned only with peptides binding to _{q}(_{q}. The remainder of this manuscript will focus on target-specific predictor in the form of

The weight vector

For the sake of comparison, we would like to highlight that when _{q}(_{p} = 0, and _{c} = 0 the predictor _{y}(_{q}(_{p}, and _{c} can be arbitrary, the class of predictors we consider here is much more general.

Indeed, a PSWM consists of a position frequency matrix _{i, j} denotes the frequency of the

The main motivation for learning a predictor from training data is that, once an accurate predictor is obtained, finding druggable peptides would be greatly facilitated. It is true that replacing or reducing the number of expensive laboratory experiments by an _{y}, has the maximal bioactivity with _{y}, the contribution of an amino acid at a certain position also depend on the _{y}(

When facing such task, heuristics and stochastic optimization methods were generally the methods of choice [

In the next section, we present an efficient algorithm guaranteed to solve

Here, we assume that we have, for a fixed target _{y}(_{y}(

In this graph, every source-sink path represent a peptide of size 5 (

A directed bipartite graph is a graph whose vertices can be divided into two disjoint sets such that every directed edge connects a vertex of the first set to the second set. The construction of the graph will proceed as follows.

Let _{i} contains all tuples (_{i} = ((_{i}, _{i+1}), _{i}) be the _{i} is defined as follows. Similarly as in the de Bruijn graph, there is a directed edge ((_{i} to (_{i+1} if and only if the last _{i} only go from vertices in _{i} to vertices in _{i+1}. There are exactly _{i} and there are exactly _{i+1}. Moreover, for any chosen integer _{i} to a vertex in _{i+1}.

We define a

Finally, let ^{hy} be a _{1} and from all nodes of _{n} to

Throughout this manuscript, we will only focus on paths starting at _{1}, …, _{n}, _{i} ∈ _{i}. For example, in

Let us now describe how edges in ^{hy} are weighted in order for the length of a path associated to _{y}(_{y}(^{hy} except those heading to the sink vertex _{t}(

For ^{hy} represents a unique peptide _{y}(

The problem of finding the peptide of highest predicted activity thus reduces to the problem of finding the longest path in ^{hy}. Despite being NP-hard in the general case, the longest path problem can be solved by dynamic programming in ^{hy} is clearly acyclic and its vertices can always be topologically ordered by visiting them in the following order: _{1}, …, _{n}, ^{hy} has

Note that

Small values of ^{hy} is given in

^{hy}

_{2}, …, _{k},

_{t}(

_{t}(

_{1}, …, _{k} ← _{[1:k]},

_{1},

In the previous section, we demonstrated how the problem of finding the peptide of greatest predicted bioactivity was reduced to the problem of finding a path of maximal length in the graph ^{hy}. By using the same arguments, finding the peptide with the second greatest predicted activity reduces to the problem of finding the second longest path in ^{hy}. By induction, it follows that the problem of finding the ^{hy}. The closely-related ^{hy}, Yen’s algorithm for the ^{hy}. It uses a variant of the longest path algorithm presented in the previous section, that allows a path to start from any node of the graph. Lawler improvement to the algorithm is not part of the presented algorithm to avoid unnecessary confusion but is part of the implementation we provide. The time complexity of the resulting algorithm is competitive with the latest work on

^{hy}, λ,

_{[0:k]}, 1), …, (_{[l − k:l]},

(^{hy}

_{[0:j+k]}

_{[0:j+k]} =

_{[j:j+k]},

_{y}(

The algorithm of _{p}, and _{c}.

Having the

It is easy to use the _{y}(_{y}(

Split and pool combinatorial peptide synthesis is a simple but efficient way to synthesize a very wide spectrum of peptide ligands. It has been used for the discovery of ligands for receptors [

Clearly, not every peptide has an equal probability of binding to a target. More restrictive protocols have been proposed to increase the hit ratio of this combinatorial experiment. For example, one could fix certain amino acids at specific positions or limit the set of possible amino acids at this position (for example, only use hydrophobic amino acids). Such practice will impact the outcome of the combinatorial experiment. One can probably increase the hit ratio by modifying (wisely) the proportion of amino acids that can be found at different positions in the peptides. To explore more thoroughly this possibility, let us define a (combinatorial chemistry)

We present a method for efficiently computing exact statistics on the screening outcome of a peptide library synthesized according to a protocol

Such statistics will, for example, assist chemists in designing a protocol with a greater hit ratio and avoid superfluous experiments. Furthermore, we will demonstrate in the next section that the computation of these statistics can be part of an iterative procedure to accelerate the discovery of bioactive peptides. Indeed, having the average predicted bioactivity data will help with the design of a protocol that synthesizes as many potential active candidates as possible. In addition, the predicted bioactivity variance will allow for better control of the exploration/exploitation trade off of the experiment. Finally, as described in the previous section, a widely used practice for optimizing peptides is to assign residues at certain positions or restrict them to those that have specific properties such as charge or hydrophobicity. It is now possible to quantify how such procedure will impact the bioactivity of combinatorially synthesized peptides.

The proposed approach makes use of the graph ^{hy}, the protocol _{y} when peptides are drawn according to the distribution

We propose an iterative process that makes use of the proposed algorithms to accelerate the discovery of bioactive peptides. The procedure is illustrated in _{y}. Next, _{y} is used for the generation of ^{hy} as described previously. A protocol

Two public datasets were used to test and validate our approach. The first dataset consisted of 101 cationic antimicrobial pentadecapeptides (CAMPs) from the synthetic antibiotic peptides database [

The second dataset consisted of 31 bradykinin-potentiating pentapeptides (BPPs) reported in [

To assess the capability of the proposed approach to improve upon known peptides, two experiments were carried out using the CAMPs and BPPs peptide datasets. For both experiments, a predictor of biological activity was learned by kernel ridge regression (KRR) for the each datasets: _{CAMP} and _{BPP}. Hyper-parameters for the GS kernel (_{c}, _{p}) and the kernel ridge regression (_{c} = 6.4, _{p} = 0.8, and _{CAMP} and _{c} = 0.8, _{p} = 0.2, and _{BPP}.

Using the

For the CAMPs dataset, the proposed approach predicted that peptide WWKWWKRLRRLFLLV should have an antibacterial potency of 1.09, a logarithmic improvement of 0.266 over the best peptide in the training set (GWRLIKKILRVFKGL, 0.824), and a substantial improvement over the average potency of that dataset (average of 0.39). The antimicrobial activity of the top 100,000 peptides are showed in

On the BPPs dataset, the proposed approach predicted that the pentapeptide IEWAK should have an activity of 2.195, slightly less than the best peptide of the training set (VEWAK, 2.73, predicted as 2.192). However, the predicted activity of IEWAK is much better than the average peptide activity of the dataset, which is 0.71. One may ask why IEWAK has a lower predicted biological activity than VEWAK, which was part of the training data. It is common for machine learning algorithms to sacrifice accuracy on the training data to prevent overfitting. Despite this small discrepancy, the model is very accurate on the training data (correlation coefficient of 0.97). Another possible explanation for this discrepancy is that the biological activity of VEWAK could be slightly erroneous as the learning algorithm could not find a simple model given such an outlier. It seems that the predicted activity of VEWAK is more coherent with the whole data than its measured activity.

Hence, our proposed learning algorithm predicts new peptides having biological activities equivalent to the best of the training set and, in some cases, substantially improved activities. The next section present an

To further validate the approach, a number of antimicrobial peptides identified during the _{CAMP}, the same predictor used during the previous validation.

The two most active peptides of the CAMPs dataset (Peptide #5 and #6) were synthesized for comparison. We also synthesized one peptide with poor activity (Peptides #7) as a control. We used the proposed approach with the predictor _{CAMP} to generate a list of

Previously, we described a methodology (illustrated in

Hence, to validate the proposed methodology, we replaced the laboratory experiments that would quantify the bioactivity level of peptides by an oracle for each dataset. We choose to use _{CAMP} and _{BPP} as oracle as they represent, so far, the best understanding of the studied phenomena. These oracles will be used to quantify the bioactivity level of randomly generated peptides and those proposed by our methodology. Note that, examples used to learn the oracles are not available to our algorithm during the validation. Consequently, the validation method used was the following.

We randomly generated

To measure the bioactivities, we replaced the laboratory experiments by the oracle.

We used these random peptides of low bioactivities to learn a second predictor _{random}.

The predictor _{random} is used to initiate the graph-based approach. We then obtained the

The new peptides bioactivities are validated by the oracle (instead of performing laboratory experiments).

Finally, we compared the bioactivities of the initial set of peptides (randomly generated) and those proposed by our approach.

_{random} |
|||||||
---|---|---|---|---|---|---|---|

Dataset | Value of |
Average | Max. | Average | Max. | Correlation Coef. | |

CAMPs | 100 | −0.58 | 0.17 | 0.76 | 0.83 | 0.51 | |

1000 | −0.59 | 0.18 | 1.07 | 1.09 | 0.90 | ||

BPPs | 100 | 0.31 | 1.39 | 1.50 | 2.04 | 0.67 | |

1000 | 0.26 | 1.36 | 1.66 | 2.20 | 0.93 |

Bioactivity comparison between the standard combinatorial screening (_{random} were computed using the oracle.

Predicted | MIC (μ |
Most Similar Peptide in the Training Set | |||
---|---|---|---|---|---|

# | Peptide Sequence | Peptide Sequence | % Similarity | ||

1 | YWKKWKKLRRIFMLV | 2 | 8 | LWKLFKKIRRVLRVL | 40.0 |

2 | WWKRWKKLRRIFLML | 4 | 4 | LWKLFKKIRRVLRVL | 40.0 |

3 | WWKRWKRIRRIFMMV | 4 | 8 | LWKLFKKIRRVLRVL | 40.0 |

4 | WWKWWKRLRRLFLLV | 16 | 16 | LWKLFKKIRRLLKVL | 46.6 |

5 | KWKLFKGIRAVLKVL | 4 | 8 | - | - |

6 | GWRLIKKILRVFKGL | 4 | 4 | - | - |

7 | KWKLFLGILAVLKVL | > 32 | > 32 | - | - |

Minimal inhibitory concentration (MIC) from

As expected, on both datasets, the number of peptides drawn (

Using the same _{random}), we were able to reach an antimicrobial potency of 0.83 (according to oracle, not to the prediction of _{random}). Such antimicrobial potency is similar to the best peptide of the (unseen) CAMPs dataset and much better than the best of the

_{random} and the oracle accuracies on the CAMPs and BPPs databases_{random} was used to predict the bioactivity values of unseen but _{random} predictions and the values in both databases. Since, in this simulation, _{random} was learned only with random peptides that, as pointed out above, have low bioactivity, it is interesting to evaluate its accuracy on these databases.

Correlation coefficients are shown in the last column of _{random} is bound to be less accurate than the oracle, these results demonstrate the capability of our approach to learn a predictor using low bioactivity peptides to obtain highly active ones.

_{random} on the CAMPs data when varying R, the number of random peptides used for training. Near optimal accuracy is reached when _{random} is initiated with approximately

The results presented here serve to demonstrate the ability of the proposed approach to predict potential functional motifs and to compare to position-specific weight matrix (PSWM) as they can be illustrated as a motif.

For the CAMPs dataset, we used _{CAMP} as oracle and hidden all peptides in this dataset from the rest of the procedure. Using the oracle, we predicted the best

Top motif: the best 1,000 peptides obtained from the oracle. Middle motif: the best 1,000 peptides obtained from _{random}. Bottom motif: the best 1,000 out of 1,000,000 random peptides.

Using only the predictor _{random}, trained on _{random}). The motif is shown in middle panel of _{random}. To push the analysis even further, we also computed the motif when _{random} is trained with only

This provides evidence that the proposed approach could uncover complex signals for new, poorly understood, proteins. For example, one could learn a multi-target predictor for peptides binding to the major histocompatibility complex [

To compare our approach to PSWM, we took the same _{random} and generated a PSWM. The signal in PSWM motif was very poor, generating a meaningless motif (not shown). We increased the number of random peptides to

This clearly illustrates the potential of the proposed approach for accelerating the discovery of potential peptidic effectors and, possibly, for achieving a better understanding of the binding mechanisms of polymorphic molecules.

We proposed an efficient graph-based algorithm to predict peptides with the highest biological activity for machine learning predictors using the GS kernel. Combined with a multi-target model, it can be used to predict binding motifs for targets with no known ligands.

To increase the hit ratio of combinatorial libraries, we demonstrated how a combinatorial chemistry protocol relates to a PSWM. This allowed us to compute the expected predicted bioactivity and its variance that can be exploited in combinatorial chemistry. These steps can be part of an iterative drug discovery process that will have immediate use in both the pharmaceutical industry and academia. This methodology will reduce costs and the time to obtain lead peptides as well as facilitating their optimization. Finally, the proposed approach was validated in a real world test for the discovery of new antimicrobial peptides. These

The

(PDF)

The authors thank Pascal Germain for his insightful comments.