^{1}

^{2}

^{3}

^{4}

^{5}

^{1}

^{2}

^{1}

^{4}

^{6}

^{3}

^{4}

^{7}

^{*}

^{1}

^{2}

^{4}

^{8}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: JKM JPB ALF AC TN BDW. Performed the experiments: JKM SO. Analyzed the data: JPB ALF JKM. Wrote the paper: JKM JPB ALF AC TN BDW. Led the computational work: JPB ALF AC.

Viral immune evasion by sequence variation is a major hindrance to HIV-1 vaccine design. To address this challenge, our group has developed a computational model, rooted in physics, that aims to predict the fitness landscape of HIV-1 proteins in order to design vaccine immunogens that lead to impaired viral fitness, thus blocking viable escape routes. Here, we advance the computational models to address previous limitations, and directly test model predictions against ^{−6}) are strongly correlated, and this was further strengthened in the regularized Ising model (^{−12}). Performance of the Potts model (^{−9}) was similar to that of the Ising model, indicating that the binary approximation is sufficient for capturing fitness effects of common mutants at sites of low amino acid diversity. However, we show that the Potts model is expected to improve predictive power for more variable proteins. Overall, our results support the ability of the computational models to robustly predict the relative fitness of mutant viral strains, and indicate the potential value of this approach for understanding viral immune evasion, and harnessing this knowledge for immunogen design.

At least 70 million people have been infected with HIV since the beginning of the epidemic and an effective vaccine remains elusive. The high mutation rate and diversity of HIV strains enables the virus to effectively evade host immune responses, presenting a significant challenge for HIV vaccine design. We have developed an approach to translate clinical databases of HIV sequences into mathematical models quantifying the capacity of the virus to replicate as a function of mutations within its genome. We have previously shown how such “fitness landscapes” can be used to guide the design of vaccines to attack vulnerable regions from which it is difficult for the virus to escape by mutation. Here, using new modeling approaches, we have improved on our previous models of HIV fitness landscape by accounting for undersampling of HIV sequences and the specific identity of mutant amino acids. We experimentally tested the accuracy of the improved models to predict the fitness of HIV with multiple mutations in the Gag protein. The experimental data are in strong agreement with model predictions, supporting the value of these models as a novel approach for determining mutational vulnerabilities of HIV-1, which, in turn, can inform vaccine design.

The ideal way to combat the spread of HIV-1 is with an effective prophylactic or therapeutic vaccine

CD8+ T cells are instrumental in reducing viral load in HIV-1 acute infection

Our group has developed computational models to identify such vulnerable regions of the HIV-1 proteome and to predict the fitness landscape of HIV-1 proteins, providing tools for designing vaccine immunogens that may limit both HIV-1 evasion of CD8+ T cell responses and the development of compensatory mutations

This approach, however, does not allow us to determine precisely which residues should be targeted, as it does not quantify the relative replicative viability of viral strains bearing specific mutations. Nor does it identify viable escape routes that remain upon targeting residues in the vulnerable regions, or inform how best to block them. To begin to address these issues, we developed a computational model, rooted in statistical physics, which aims to predict the viral fitness landscape (viral fitness as a function of amino acid sequence) from sequence data alone and applied it to HIV-1 Gag

The idea underlying our approach is to first characterize the distribution of sequences in the population, which we expect to be correlated with fitness (see below). Due to the small number of available sequences compared to the size of the sequence space, direct estimation of the probability distribution characterizing the available sequences is precluded. Thus, we instead aim to infer the least biased probability distribution of sequences that fits the observed frequency of mutations at each site, and all correlations between pairs of mutations (the one- and two-point mutational probabilities). Mathematically, “least biased” implies the distribution that has maximum entropy in the information-theoretic sense

We expect more prevalent sequences to be more fit, consistent with expectations from simple models of evolution

Our aim in the current work is twofold. First, we present new advances in the inference and modeling of viral fitness landscapes that address previous theoretical and computational limitations. Second, we describe new

Our key hypothesis in formulating models of HIV fitness is that the prevalence of viruses with a given sequence, that is, how often the sequence is observed, is related to its fitness. Simply, fitter viruses should be more frequent in the population than those that are unfit. This hypothesis can be proven for some idealized evolutionary models

As described in our previous publication

In our original approach, we fit the Ising model parameters to precisely reproduce the observed one and two-residue mutational correlations within the MSA. However, simultaneous mutations at certain pairs of residues were never observed. This led to another deficiency in our original modeling approach in that pairs of mutations not observed in the MSA were predicted to be completely unviable (

In this work, we present three significant advances of our original model to predict viral fitness, which also the aforementioned limitations. First, we incorporate Bayesian regularization into our fitting procedure to eliminate the prediction of zero replicative fitnesses for mutations not present within our MSA. Second, we implement a new algorithm for inferring an Ising model from sequence data, which dramatically accelerates the computation of model parameters. Third, we relax the binary approximation to infer viral fitness landscapes that explicitly retain the amino acid identities at each position. We achieve this by describing the viral fitness landscape using a multistate generalization of the Ising model known as the Potts model, another established and well-studied model in statistical physics

Inference of the parameters of the Ising models, commonly referred to as the inverse Ising problem, is a canonical inverse problem lacking an analytical solution that may be tackled in many ways

To control the effects of undersampling and to improve the predictive power of the inferred fitness models, we incorporate Bayesian regularization into our inference algorithm

In an algorithmic advance over our previous fitting procedure, we fit the parameters of our regularized Ising model using the selective cluster expansion algorithm of Cocco and Monasson

An ideal model of viral fitness would be able to capture the full (unknown) distribution of correlated mutations throughout the sequence, and thus reproduce the prevalence of every viral strain. Sequences in the MSA represent a sample of the possible strains of the virus, providing information about the distribution of point mutations, pairs of simultaneous mutations, triplets of simultaneous mutations, and all higher orders. However, since the number of available sequences in the MSA is very small compared to the size of the accessible sequence space, and because mutations at most sites are rare, higher order mutations will be severely undersampled. Thus, following our previous approach we appeal to the maximum entropy principle to seek the simplest possible model capable of reproducing the single site and pair amino acid frequencies

To introduce the Potts model, we represent the sequence of a particular m-residue protein as a vector, _{k}

In analogy with the statistical physics literature, we refer to E as a dimensionless “energy,” the function _{i}_{ij}

To fit the Potts model, we implemented a generalization of the semi-analytical extension of the iterative gradient descent implemented by Mora and Bialek _{ij}^{2}), where

To test the accuracy of these models in predicting the fitness landscape of HIV-1 Gag, we performed

Mutant | Gag subunit | Category of pairs/triple | E |

186I | p24 | 78.74 | |

269E | p24 | 43.43 | |

186I269E | p24 | Sector 3 |
Infinity |

295E | p24 | 22.81 | |

186I295E | p24 | Sector 3, high E | Infinity |

181R | p24 | 44.62 | |

310T | p24 | 6.26 | |

181R310T | p24 | Sector 3, high E | Infinity |

182S | p24 | 25.13 | |

198V | p24 | Infinity | |

182S198V | p24 | Sector 3, high E | Infinity |

179G | p24 | 56.09 | |

229K | p24 | 44.63 | |

179G229K | p24 | Sector 3, high E | 97.01 |

174G | p24 | Infinity | |

243P | p24 | 66.65 | |

174G243P | p24 | Sector 3, high E | Infinity |

168I | p24 | 38.58 | |

315G | p24 | 19.11 | |

168I315G | p24 | HLA-associated, high E | Infinity |

331R | p24 | 11.77 | |

186I331R | p24 | HLA-associated, high E | Infinity |

302R | p24 | 11.10 | |

302R315G | p24 | HLA-associated, high E | Infinity |

315G331R | p24 | HLA-associated, high E | Infinity |

190I | p24 | 41.52 | |

190I302R | p24 | HLA-associated, high E | Infinity |

219Q | p24 | 6.73 | |

242N | p24 | 8.68 | |

219Q242N | p24 | p24, low E, compensatory | 10.80 |

146P | p24 | 7.22 | |

147L | p24 | 3.42 | |

146P147L | p24 | p24, low E, compensatory | 6.58 |

326S | p24 | 4.59 | |

310T326S | p24 | p24, low E, sector 3 | 10.53 |

173T | p24 | 5.92 | |

173T286K | p24 | 4.89 | |

173T286K147L | p24 | p24, low E, triple | 4.12 |

12K | p17 | 3.74 | |

12K54A | p17 | p17, low E | 4.84 |

86F | p17 | 8.00 | |

92M | p17 | 8.74 | |

86F92M | p17 | p17, high E | Infinity |

Energy cost predicted by original Ising model

Mutation pairs within an immunologically vulnerable group of co-evolving residues, termed sector 3, that we previously identified qualitatively

The tested mutants can be divided into 4 categories, _{ij}

We introduced these mutation combinations into the HIV-1 NL4-3 plasmid by site-directed mutagenesis and their presence was confirmed by sequencing, as described previously

The values of E predicted by our original and new modeling approaches for the 43 HIV-1 NL4-3 Gag mutants tested here are shown in ^{−11}, two-tailed test).

Mutant | Gag subunit | Ising E |
Regularized Ising E |
Regularized Potts E |

186I | p24 | 78.74 | 9.98 | 11.24 |

269E | p24 | 43.43 | 11.46 | 12.18 |

186I269E | p24 | Infinity | 17.77 | 18.97 |

295E | p24 | 22.81 | 9.05 | 11.03 |

186I295E | p24 | Infinity | 15.36 | 17.79 |

181R | p24 | 44.62 | 13.55 | 12.12 |

310T | p24 | 6.26 | 5.87 | 7.20 |

181R310T | p24 | Infinity | 15.74 | 14.87 |

182S | p24 | 25.13 | 7.11 | 9.68 |

198V | p24 | Infinity | 12.32 | - |

182S198V | p24 | Infinity | 15.77 | - |

179G | p24 | 56.09 | 11.14 | 11.57 |

229K | p24 | 44.63 | 10.52 | 11.68 |

179G229K | p24 | 97.01 | 17.99 | 18.81 |

174G | p24 | Infinity | 15.47 | 11.71 |

243P | p24 | 66.65 | 11.1 | 11.08 |

174G243P | p24 | Infinity | 22.9 | 18.32 |

168I | p24 | 38.58 | 9.8 | 10.30 |

315G | p24 | 19.11 | 6.85 | 10.64 |

168I315G | p24 | Infinity | 14.78 | 16.39 |

331R | p24 | 11.77 | 7.37 | 9.17 |

186I331R | p24 | Infinity | 13.68 | 15.85 |

302R | p24 | 11.1 | 7.75 | 9.23 |

302R315G | p24 | Infinity | 12.4 | 15.30 |

315G331R | p24 | Infinity | 10.56 | 15.22 |

190I | p24 | 41.52 | 8.2 | 11.41 |

190I302R | p24 | Infinity | 12.28 | 16.12 |

219Q | p24 | 6.73 | 5.65 | 6.90 |

242N | p24 | 8.68 | 6.7 | 8.05 |

219Q242N | p24 | 10.8 | 8.04 | 10.07 |

146P | p24 | 7.22 | 5.62 | 6.26 |

147L | p24 | 3.42 | 4.25 | 6.54 |

146P147L | p24 | 6.58 | 5.77 | 4.74 |

326S | p24 | 4.59 | 4.78 | 5.69 |

310T326S | p24 | 10.53 | 7.72 | 8.81 |

173T | p24 | 5.92 | 5.81 | 7.02 |

173T286K | p24 | 4.89 | 6.56 | 7.75 |

173T286K147L | p24 | 4.12 | 5.93 | 6.78 |

12K | p17 | 3.74 | 1.91 | 4.38 |

12K54A | p17 | 4.84 | 3.19 | 5.63 |

86F | p17 | 8 | 4.53 | 6.00 |

92M | p17 | 8.74 | 6.01 | 9.43 |

86F92M | p17 | Infinity | 9.52 | 12.57 |

E is 2.98 for wild-type NL4-3 p24 and 3.43 for wild-type NL4-3 p17.

E is 3.67 for wild-type NL4-3 p24 and 1.64 for wild-type NL4-3 p17.

E is 4.43 for wild-type NL4-3 p24 and 2.81 for wild-type NL4-3 p17.

The 198V mutation was not observed within the MSA used to fit the Potts model, precluding the fitted model from assigning an energy to viral strains containing this point mutation.

The

Graphs show replication capacities of NL4-3 viruses encoding (A) Gag p24 mutation pairs with high E values that were previously identified to be in vulnerable co-evolving groups

Briefly, all Gag p24 sector 3 mutation pairs with high E values were not viable in our assay system, and were assigned a replication capacity of zero (

Those p24 mutation combinations, including known compensatory pairs, that were predicted to have low E values displayed replication capacities similar to that of wild-type NL4-3, indicating that these combinations had little or no cost to HIV-1 replication capacity in accordance with predictions (

Overall, for only two (86F92M and 315G331R) of the 17 mutant pairs the fitness measurement did not correspond to the E value prediction of high or low fitness cost. It should however be noted that the disparity between E values and measured replication capacities for these mutant pairs is somewhat mitigated in the regularized models. The E values for the regularized Ising model for these mutants (which were assigned an E value of infinity by the original Ising model) are lower than those of other mutants previously assigned infinite energies, and the same is true for mutant 86F92M in the regularized Potts model.

Next, we assessed the relationship between fitness measurements and E values predicted by our original Ising, regularized Ising and regularized Potts models using Pearson's correlation tests. There is a strong correlation between the metric of fitness (values of E, ^{−6}, two-tailed) (^{−12}, two-tailed) (^{−11}, two-tailed). There is also a strong agreement between the residue-specific Potts model energies and replication capacity (Pearson's correlation, ^{−9}, two-tailed) (

Scatter plots showing strong correlations between measured replication capacities of mutants and E values predicted by (A) original Ising (Pearson's correlation,

In practice, one may be concerned with a more coarse-grained measure of viral fitness: will a virus with a given sequence be able to replicate with similar efficiency to the wild-type, or will it be significantly impaired? To explore this point, we grouped the experimentally tested mutants into two categories, “fit” (

Graphs show the ability of E classifiers, predicted by regularized Ising (panel A) and Potts (panel B) models, to correctly classify HIV-1 NL4-3 Gag mutants into unfit (

In this study, we have substantially advanced our modeling approaches and tested the predictive power of these models by

Simple theoretical analysis suggests that models which differentiate between different mutant amino acids at the same site, like the Potts model employed here, will be necessary to make fitness predictions for highly mutable proteins such as Env and Nef, or to predict the fitness of sequences containing sites with mutations to less frequently observed amino acids. Using a simple toy model, we show in

While this study confirms the usefulness of this method for predicting HIV-1 replicative fitness, at least for closely related sequences, caution will be necessary in applying this method to predict the relative fitness of multiple strains separated by a large number of mutations. In the measure of prevalence used to infer the Ising and Potts model fitness landscapes, factors such as phylogeny are implicitly included. Analysis conducted in _{i}, and that a correction should be included for predictions of energy or fitness. This form of a correction is sensible, as phylogenetic effects should make mutations at individual residues less frequent, leading to larger inferred fields. For closely related strains such as those studied here experimentally, any systematic inaccuracies in the energy due to phylogeny should be similar in magnitude, and thus

In addition to phylogeny, other factors such as host-pathogen interactions and pure stochastic fluctuations affect the observed distribution of sequences, and could complicate fitness predictions. In another work

We also note that some caution should be taken in comparing E values for sequences belonging to different proteins. The fitness predictions of the Ising and Potts models are unchanged by a constant shift in energy for all sequences, thus comparisons of absolute energy values are not physically meaningful. Differences in energy between two sequences in the same protein, however, can be unambiguously interpreted as the fitness ratio of those sequences. This is the approach we have taken when examining E values from sequences with mutations in p17 and p24 together: rather than comparing the absolute energies, we compare the differences in energy between the mutant and the NL4-3 reference sequence in each protein, which reflect the fitness of the mutant relative to the NL4-3 reference sequence. Finally, translation from differences in energy to differences in fitness might depend on the specific protein that is being considered. While comparisons of energy differences and relative fitnesses of p17 and p24 mutants performed here exhibit no obvious incongruences, further study is needed to confirm the generality of fitness predictions across proteins.

In the case of two mutant pairs (186I295E and 186I331R) that were predicted by the models to have very low fitness, partial reversions and/or additional mutations spontaneously arose in culture that restored virus viability. However, with the exception of one of the spontaneous mutations (260D) observed in combination with 186I331R that modestly decreased the predicted energy (increased fitness) in the regularized Ising model but not the Potts or original Ising models, the models do not predict lower energies (increased fitness) for these mutant pairs in combination with the additional mutations arising

In future work, the model predictions will be further validated in animal models by testing the viable escape pathways predicted to emerge following immunization with immunogens containing vulnerable HIV-1 regions only. The validated fitness landscape could then be used to design vaccine immunogens containing epitopes from the vulnerable regions that could be presented by people with diverse HLAs and that target residues particularly harmful to HIV-1 when mutated simultaneously, thereby substantially diminishing viral fitness and/or blocking viable mutational escape

(ZIP)

(PDF)