^{1}

^{2}

^{1}

^{*}

Conceived and designed the experiments: TZ XFW. Performed the experiments: ZC. Analyzed the data: ZC TZ XFW. Contributed reagents/materials/analysis tools: XFW. Wrote the paper: ZC TZ XFW.

The authors have declared that no competing interests exist.

Influenza viruses have been responsible for large losses of lives around the world and continue to present a great public health challenge. Antigenic characterization based on hemagglutination inhibition (HI) assay is one of the routine procedures for influenza vaccine strain selection. However, HI assay is only a crude experiment reflecting the antigenic correlations among testing antigens (viruses) and reference antisera (antibodies). Moreover, antigenic characterization is usually based on more than one HI dataset. The combination of multiple datasets results in an incomplete HI matrix with many unobserved entries. This paper proposes a new computational framework for constructing an influenza antigenic cartography from this incomplete matrix, which we refer to as Matrix Completion-Multidimensional Scaling (MC-MDS). In this approach, we first reconstruct the HI matrices with viruses and antibodies using low-rank matrix completion, and then generate the two-dimensional antigenic cartography using multidimensional scaling. Moreover, for influenza HI tables with herd immunity effect (such as those from Human influenza viruses), we propose a temporal model to reduce the inherent temporal bias of HI tables caused by herd immunity. By applying our method in HI datasets containing H3N2 influenza A viruses isolated from 1968 to 2003, we identified eleven clusters of antigenic variants, representing all major antigenic drift events in these 36 years. Our results showed that both the completed HI matrix and the antigenic cartography obtained via MC-MDS are useful in identifying influenza antigenic variants and thus can be used to facilitate influenza vaccine strain selection. The webserver is available at

Influenza antigenic cartography is an analogy of geographic cartography, and it projects influenza antigens into a two- or three-dimensional map through which we can visualize and measure the antigenic distances between influenza antigens as we visualize and measure geographic distances between the cities in a geographic cartography. Thus, influenza antigenic cartography can be utilized to identify influenza antigenic variants, and it is useful for influenza vaccine strain selection. Here we develop a new computational framework for constructing influenza antigenic cartography based on hemagglutination inhibition assay, a routine antigenic characterization method in influenza surveillance and vaccine strain selection. This method can be used for antigenic characterization in vaccine strain selection for both seasonal influenza and pandemic influenza.

An influenza virus is a negative-stranded RNA virus that belongs to the

The peak influenza season in the northern hemisphere is from January to April every year. More than

In the United States, vaccination is the primary option for reducing the effects of influenza. The seasonal influenza vaccines used in the past decades include three viral components: H1N1 influenza A virus, H3N2 influenza A virus, and influenza B virus. In an effective vaccination program, vaccine strain selection will be the most important step since the highest protection could be achieved only if there is an identical antigenic match of the vaccine and epidemic virus HA and NA antigens, especially HA, which is the primary target of human immune system. However, as an RNA virus, influenza A virus has rapid mutations in these two proteins, and such mutations can cause a change of antigenicity, thus making vaccines ineffective. Mutations in HA and NA are also referred as antigenic drift.

Immunological tests, such as hemagglutination inhibition (HI) assay, enzyme-linked immunosorbent assay (ELISA), and microneutralization assay, have been utilized to identify antigenic variants among the circulating influenza strains. Among these assays, HI, has been one of the routine procedures in influenza vaccine strain selection. HI assay is an experiment to measure how a testing influenza antigen (virus) and a reference antiserum (antibody) react. The antibody is usually diluted in

In a typical influenza HI assay, generally less than

High reactor are in the diagonal zone, and the low reactors and the missing values will appear more when the approaching the challenge zone.

The goal of this paper is to present a computational framework for influenza cartography construction which we call Matrix Completion-Multidimensional Scaling (MC-MDS). An important aspect of this framework is that temporal modeling can be easily incorporated, which as we shall show, is useful for dealing with HI tables with herd immunity induced temporal bias. Our framework includes two integrated steps: (1) a low rank matrix completion algorithm is first employed to fill in the entries of the HI matrix; (2) a MDS algorithm is utilized to map the antigens (or similarly, antibodies) into a two dimensional space for visualization. Our approach explicitly separates the visualization (cartography) step from the matrix completion step, making it easier to incorporate temporal models. Our experience shows that while temporal modeling is beneficial in both steps, it is less important in the first step, for which we may simply employ a sliding window approach; however it is more essential in the second step, for which we propose a more complex herd-immunity temporal regularization model as described in the

While greater details are given in the

In this work we are specifically interested in HI datasets existing accumulating original, such as the immunological datasets of human origin. In a typical HI dataset, three types of data entries are present: Type I, a regular HI titre; Type II (low reactors), the value is defined as “less than a threshold”, e.g.

The specific benchmark dataset used in our study includes

Three types of data are present in HI data: type I, a regular HI titre marked in red cross; Type II, the value is defined as ‘less than a threshold’, e.g.

As pointed out above, most Type I data are located across the diagonal line of the HI matrix, which significantly deviates from the “missing uniformly at random” assumption in classical matrix completion. In order to reduce this bias, we adopt a sliding window approach where each low rank matrix completion will be performed in a HI sub-matrix, which has fewer amount of Type II and Type III data that more closely satisfy the “missing uniformly at random” assumption. The remaining entries that are not covered by the (sliding window) sub-matrices can be filled with a global matrix completion algorithm – those entries will be predicted with less accuracy due to the banded-structure of the HI data that violates the “missing uniformly at random” assumption.

The windows are based on the temporal spans of influenza A viruses. In order to complete the entire matrix, the algorithm will slide yearly along with both the dimensions of antigens and antisera to ensure the time difference between all antigens and antisera are within a certain window size. In order to obtain an optimal window size and best rank in matrix competition, we tested six different sizes, including

Rank | |||||||||

W | |||||||||

– | – | – | – | – | – | ||||

After the matrix completion step, we need to project the influenza antigens onto a two-dimensional (2D) map. In order to obtain accurate global distances, we incorporate a temporal model in MDS based on the fact that the influenza antigens continue to evolve under the accumulating immune pressures of human population

In HI tables, a Type II value is resulted from experimental limitation of HI assay and reflects a weak (or low) immunological reaction between a testing antigen/antiserum pair. Although this value is not as informative as a Type I value, it is more useful than a Type III value (missing value). In particular, if a particular virus has type I values with a certain set of antibodies that show strong reactions, while another virus reacts weakly with the same set of antibodies (resulting in type II values), then the global distance between their 2D cartography embeddings should be relatively large. A set of constraints on global distances can be derived from this observation. The details can be found in the

There are four parameters

To demonstrate the potential impacts of Type II data (low reactors) and Type III data (missing values) on the influenza cartography, we performed experiments using simulated HI matrices containing

To assess whether MC-MDS and metric MDS can accurately recover the HI values in the HI data, we calculated the local RMSEs for the Type I data using

HI recovering | Cartography construction | ||||

Local RMSE |
Robustness | Global distance measurement | |||

Correlation coefficient (CC value) |
Maximum distance(MD value) |
Pairwise distance RMSE (PD value) |
|||

Simulation | MC-MDS | ||||

Metric MDS | |||||

H3N2 | MC-MDS | — | |||

Metric MDS | — |

HI recovery ability is assessed by calculating the RMSE values on the Type I data using

A correlation coefficient (CC value) is calculated from the pairwise distances among antigens for every two independent runs. The CC values in this table were calculated from

A maximum distance (MD value) refers to the difference between the maximum distance among any antigens in the benchmark cartography and that from the method being evaluated (either MC-MDS or metric MDS). The MD values in this table were calculated from

A pairwise distance RMSE (PD value) is the difference between the pairwise distances among all antigens in the benchmark cartography and those from the method being evaluated. The PD values in this table were calculated from

The value in the bracket is the standard deviation of the associated parameter.

The effectiveness of a cartography construction algorithm can be assessed using figures of merit that measure its robustness and correctness. The robustness of a method is determined by the correlation coefficient (CC value) that is calculated from the pairwise distances among antigens for every two independent runs. The correctness of cartography is measured by two values: the difference between the maximum distances (MD value) between any antigens in the benchmark cartography and that from the method being evaluated (either MC-MDS or metric MDS); the pairwise distance RMSEs (PD value), calculated by measuring the difference between the pairwise distances among all antigens in the benchmark cartography and those from the method being evaluated. We performed

As specified in the

(a) HI matrix (

For comparison, we implemented the metric MDS method of

While these two methods achieve similar matrix completion accuracies, the reconstructed cartographies reveal a more significant difference. As we pointed out earlier, this is because the temporal bias (of data type distribution) in HI tables has stronger impact in the MDS step, especially when we compare global distances. Without temporal modeling, the accuracy of global distances between two points (representing two viruses) in the 2D cartography decays more rapidly when the two points become further apart in time. While this reduction of accuracy is an unavoidable limitation of the banded structure in HI tables (

Finally we summarize the main observations from this simulation study as follows. Both MC-MDS and metric MDS methods achieved similar accuracy in recovering HI values. This means that they achieve comparable performance in the matrix completion sub-task, which is less sensitive to the temporal bias problem in HI tables. However, without temporal modeling, the global distances among far away points in the reconstructed cartography become inaccurate. Therefore it is helpful to incorporate temporal modeling into the MDS step in order to reduce the temporal bias effect. The proposed MC-MDS framework (with herd-immunity temporal model) is effective in reducing the bias problem, and it leads to more accurate cartography. The metric MDS appears to be less stable and it generates less accurate cartographies because the method does not address the temporal bias problem.

In the second experiment, we use MC-MDS to construct influenza cartography for H3N2 influenza A viruses from 1968 to 2003 using the HI datasets from Smith

The

The resulting cartography can be compared to the published antigenic map in Smith

The influenza viruses labeled with the cluster names SI87, BE89, BE92, WU95 are defined by

Similar to the simulated HI data experiments, we can assess the robustness of MC-MDS and metric MDS on the H3N2 data (

From the

Each year, about

As an analog of geographic cartography, influenza cartography can be used to visualize and measure antigenic distances between influenza viruses. An essential criterion for a new influenza vaccine strain is significant antigenic divergence (e.g.

In this study, we proposed a new computational framework for constructing an influenza antigenic cartography, and demonstrated its usefulness in antigenic characterization. This computational framework has two integrated steps: (1) through a matrix completion algorithm, influenza antigenic distance matrices are constructed; (2) through MDS (with herd-immunity temporal model), influenza antigens (viruses) are projected onto a two-dimensional cartography. We specifically pay attention to the major challenge that is caused by the temporal bias in HI datasets. That is, the banded structure of HI entries indicates that the matrix entries are not missing uniformly at random (

Although MC-MDS is presented as a 2D cartography construction method in this paper, it can be extended easily for 3D (or even higher dimensional) cartography by modifying the resulting cartography dimension in the MDS step of our computational framework.

The temporal regularization in MC-MDS is based on the fact that the influenza antigens continue to evolve under the accumulating immune pressures of human population

Besides the immunological datasets for the influenza viruses (such as those of human origin) with the accumulating immunity from their hosts, there are other immunological datasets for the influenza viruses from mutations (not necessarily accumulating immunity), such as those of swine or avian origin. For the latter case (e.g. the data of swine or avian origin), our limited experiments in H5 and H7 studies suggested that the users can use MC-MDS directly without temporal model (data not shown). However, there might be additional structures to explore in such data. This requires more extensive investigations in the future.

We introduced a new computational framework for influenza antigenic cartography construction from HI datasets. This approach, which we refer to as MC-MDS, integrates two mathematical procedures: matrix completion and MDS projection (with temporal modeling). Using the AGD matrix completion algorithm on HI datasets from 1968 to 2003, we successfully identified the eleven reported clusters of antigenic variants that represent major antigenic drift events during these

The benchmark HI dataset is adopted from

Within this dataset, three types of data points are present: Type I, a regular HI titre; Type II, the value is defined as ‘less than a threshold’, e.g.

To study the effect of temporal bias on influenza cartography, we simulate HI matrices with

In the HI matrix with Type II values, all the HI values no more than a titre of

The goal of matrix completion is to fill the missing entries in an incomplete matrix based on appropriate mathematical models of the matrix. It is a traditional mathematical problem that has been studied for many decades. Early contributions on this problem include Schur

If we do not consider the temporal bias effect, then the antigenic cartography task can be formulated as a matrix completion problem. Simply, in an HI matrix, there are

Here we describe AGD matrix completion method, which is developed based on gradient decent method. AGD method assumes the low rank matrix completion model (1).

If type II data are not present, one can employ the following optimization formulation to estimate the missing values

The function

In the AGD method, we take

First, the algorithm uses SVD to obtain the factorization

We then apply the following alternating optimization procedure until convergence or when certain number of iterations are reached.

Fix

Update

The first two steps are repeated until convergence or reaching a pre-defined number of iterations.

The gradient of

Although Type II data is not as informative as Type I data, they still provide useful information. Therefore we have to modify (2) to include type II data. First we introduce threshold values

One advantage of this formulation is that we can employ any optimization algorithm that solves (2) to solve (5). We start with an initial estimate of

Let

Update

The procedure is a principled approach to handle Type II data, and it can be used with any algorithm that optimizes (2).

The two parameters,

The performance of matrix completion is evaluated using the following three criteria in this study: root mean squared error (RMSE), correlation coefficient, and biological interpretation. Given

For the temporal based MDS, we can define the distance between two viruses as the Euclidean norm between the rows of the completed HI table corresponding to the two viruses. In evaluation, we use the local pairwise distances among temporally close by viruses because these distances are more reliable. In particular, the local pairwise distances are partitioned into

The correlation coefficient (CC) between two vectors measures the strength and direction of their linear relationship. Let

Clearly, a larger CC value indicates the two vectors are closely related. For every two runs, we will have two distance vectors and one CC value. In

The biological interpretation is based on separation and quantification of the reported antigenic variant groups in the influenza antigenic cartography.

In order to reduce the temporal bias in HI matrices, we adopt a sliding window approach in the matrix completion step. The rational for sliding window matrix completion is that the temporal bias effect becomes much smaller in temporally grouped sub-matrices than in the entire HI matrix. This means that the effect of temporal bias will be reduced when we complete each sub-matrix separately. Therefore in our approach low rank matrix completion will be performed separately in each HI sub-matrix. In order to complete the entire matrix, the algorithm will slide yearly along with both the dimensions of antigen and antisera to ensure the time difference between all antigen and antisera are within a certain window size

Multidimensional scaling (MDS) is a statistical technique widely used in information visualization. It embeds a set of data into low dimension vectors while preserving their pair-wise distances. The projection of viruses into two or three dimensional space can be viewed as an analog of a geographic cartography; thus this is referred to as influenza antigenic cartography. Due to the temporal bias effect in HI tables, we have to incorporate a temporal model into the MDS algorithm to reconstruct global distances more accurately. In this work, we consider a biologically motivated temporal regularization criterion. The regulation in our temporal model is based on the fact that the influenza antigens continue to evolve under the accumulating immune pressures of human population

This intuition is mathematically incorporated into our temporal regularization condition. Specifically we assume that on the 2D cartography, influenza viruses tend to evolve along (approximate) straight-line segments during short time spans; that is, they tend to evolve in directions as far away from recently appeared viruses as possible. The concrete mathematical formulation is described below.

First we denote by

The first term is the standard MDS. The second term means that viruses within each group should be close to each other. The third term is the mathematical formulation that formalizes the biological intuition that viruses tend to evolve along straight-line segments during short time spans.

Besides the above error function, we impose constraints on global distances that can be derived from the original dataset. We know that each reference antiserum is associated to an antigen. Let the reference antiserum be

The metric MDS method is developed by Smith

Similar to AGD , this algorithm also requires a pre-defined dimension (rank)

We are grateful for the critical comments from three anonymous reviewers, which helped improve this study very much. Approved for publication as Journal Article No. J11876 of the Mississippi Agricultural and Forestry Experiment Station, Mississippi State University.