I have read the journal's policy and the authors of this manuscript have the following competing interests: JS declares partial ownership of SK Analytics. TKY and SK declare no competing interests.

Recent research has produced a number of methods for forecasting seasonal influenza outbreaks. However, differences among the predicted outcomes of competing forecast methods can limit their use in decision-making. Here, we present a method for reconciling these differences using Bayesian model averaging. We generated retrospective forecasts of peak timing, peak incidence, and total incidence for seasonal influenza outbreaks in 48 states and 95 cities using 21 distinct forecast methods, and combined these individual forecasts to create weighted-average superensemble forecasts. We compared the relative performance of these individual and superensemble forecast methods by geographic location, timing of forecast, and influenza season. We find that, overall, the superensemble forecasts are more accurate than any individual forecast method and less prone to producing a poor forecast. Furthermore, we find that these advantages increase when the superensemble weights are stratified according to the characteristics of the forecast or geographic location. These findings indicate that different competing influenza prediction systems can be combined into a single more accurate forecast product for operational delivery in real time.

Timely forecasts of infectious disease transmission can help public health officials, health care providers, and individuals better prepare for and respond to disease outbreaks. Work in recent years has led to the development of a number of forecast systems. These systems provide important information on future disease incidence; however, all forecasting systems contain inaccuracies, or error. This error can be reduced by combining information from multiple forecasting systems into a superensemble using Bayesian averaging methods. Here we compare 21 forecasting systems for seasonal influenza outbreaks and use them together to create superensemble forecasts. The superensemble produces more accurate forecasts than the individual systems, improving our ability to predict the timing and severity of seasonal influenza outbreaks.

In the United States, seasonal influenza outbreaks occur every winter; however, the timing and severity of these seasonal outbreaks vary considerably from year to year. Accurate forecast of key characteristics of influenza outbreaks would allow public health agencies to better prepare for and respond to epidemics and pandemics. To this end, there has been significant work in recent years to develop forecasts of seasonal influenza outbreaks[

While some forecast methods may be consistently superior to others, the relative performance of individual forecast methods can also vary according to the specifics of the population, location, or outbreak of interest. For example, a forecast method developed for a densely populated urban area may be less accurate in a sparse rural setting. Alternatively, some forecast methods may be better suited for prediction of outbreaks that are smaller or larger than typically observed. The optimal forecast method may also depend on when in the season the forecast is being generated.

Previous comparisons of subsets of the individual forecast methods used in this study have indeed revealed differences in forecast performance. Yang et al. [

In weather and climate forecasting, discordant forecasts from competing models are combined into superensemble forecasts in order to offset the biases of each individual model. The resulting superensemble forecasts are more accurate than forecasts from any one model form [

Here, we compare the accuracy of a suite of 21 competing forecast methods, as well as a superensemble forecast, in retrospective forecasts of influenza epidemics (See

We find that the superensemble forecasts are more accurate than any individual forecast system, and that this advantage increases when superensemble weights are stratified according to the characteristics of the forecast, or by geographic location.

For the ensemble and particle filter systems, observations of influenza incidence from the beginning of the season to the week of forecast initiation are used to optimize each of the 4 mathematical models. The optimized model is then run to the end of the season to generate a weekly forecast of influenza incidence, as measured by ILI+, an estimate of influenza positive patients per 100 patient visits to outpatient health care providers (see

Thus, each individual forecast method produces a weekly estimate of ILI+ from the time of forecast through the end of the influenza season. These trajectories were used to calculate key characteristics of each outbreak: the peak incidence, peak timing, and total incidence. Forecast skill was assessed based on the mean absolute error (MAE) in predictions of each of these metrics.

Distributions of seasonal influenza outbreak peak timing, peak ILI+ and total ILI+ across 48 states and 97 cities are shown in

Boxplots show the range of metric values observed each season across 97 cities and 48 states. ILI+ measures the number of influenza positive patients per 100 visits to outpatient visits to health care providers.

The baseline level MAE for each metric is the corresponding score of the baseline superensemble forecast (listed in

Line plots show MAE for forecast peak week (top row), peak ILI+ (middle row) and total ILI+ (bottom row), averaged over all locations. Each line shows the MAE of an individual forecast method. For the model-filter systems, colors indicate groupings by filter type.

For the initial baseline set of superensemble forecasts, a single set of weights for each target metric was applied to the competing individual forecasts each season (

The colors indicate the weight assigned to an individual forecast system specified by the horizontal axes and the influenza season specified by the vertical axes. For example, the upper-right square in each subplot shows the weight assigned to the EAKF-SEIR forecast system for the 2005–2006 influenza season.

The MAEs of superensemble forecasts are shown in

Forecast error averaged over all seasons, forecast weeks, and locations.

Weighting Scheme | MAE | ||
---|---|---|---|

Peak Week | Peak ILI+ | Total ILI+ | |

Baseline | 1.85 | 1.19 | 0.47 |

Week | 1.72 | 1.18 | 0.47 |

HHS Region | 1.86 | 1.15 | 0.45 |

Forecast lead | 1.74 | 1.19 | 0.46 |

Actual lead | 1.39 | 1.13 | 0.46 |

Population density | 1.87 | 1.18 | 0.47 |

Population Size | 1.85 | 1.18 | 0.47 |

We then stratified the weights by geographical region, forecast week, lead time relative to predicted peak, lead time relative to observed outbreak peak, population size and population density. These variables were pre-specified on the basis of previous work indicating that they may influence the accuracy of individual forecast methods (for example [

Each line shows the results of one forecast, with grey dotted lines representing the 21 individual forecasts and colored lines representing superensemble forecasts. SE-baseline refers to the baseline superensemble forecast, while SE-week, SE-region, SE-forlead and SE-actlead refer to superensemble forecasts with weights stratified by forecast week, HHS region, lead relative to predicted peak, and lead relative to observed peak, respectively.

Averaged within influenza seasons, the superensemble forecasts generally outperformed individual forecasts (

Stratifying superensemble weights by forecast week or lead time relative to predicted peak, which can serve as real-time proxies for actual lead time, led to decreases in MAE for peak timing, but had only a small effect on forecasts of peak and total incidence. Meanwhile, stratifying by geographical region and population density led to small decreases in MAE for peak incidence, but degraded forecast accuracy for peak timing (

We further assessed the accuracy of superensemble forecast credible intervals by determining the fraction of observations falling within the credible intervals specified by each forecast. In a well-calibrated forecast, we would expect this fraction to correspond to the value of the credible intervals; for example, 95% of observed outcomes should fall within the 95% credible intervals of a forecast method. Overall, the forecasts were well-calibrated. The calibration varied between the three target metrics, as well as between choices of stratification variable for superensemble weighting (

In addition to comparing forecast errors averaged over many forecasts, we also compiled a ranking of individual forecast outcomes. For each forecast, we ranked the 21 point-estimates from the individual forecasts and the resulting baseline superensemble forecast from 1 (lowest absolute error) through 22 (highest absolute error) and summed the frequency of each ranking (

The colors indicate the frequency of each forecast ranking, with rank 1 assigned to the most accurate forecast and rank 22 assigned to the least accurate forecast. More optimal forecasts have a higher frequency of top rankings (warm colors on the left) and a lower frequency of bottom rankings (cold colors on the right). The upper image shows peak week, the middle shows peak incidence, and the bottom shows total incidence. This analysis includes all forecasts of total season incidence (n = 22640), and all forecasts made at or prior to the observed outbreak peak for peak week and peak incidence (n = 15187). Forecasts made after the peak had been observed were excluded from the ranking, as were forecasts where all 22 forecasts predicted the same outcome.

The superensemble gave the most accurate forecast 8.6% of the time for peak timing, which was more frequently than 12 of the individual forecast methods, but less frequently than BWO and most models-filter combinations using SEIR and SEIRS models. While the BWO, pMCMC-SEIR and pMCMC-SEIRS forecasts of peak timing had the most frequent first place rankings, they also produced many predictions that received the lowest rankings. In contrast, the superensemble forecast, as well as the SEIR and SEIRS models coupled with EAKF, EKF, and RHF ensemble filter methods, had few low rankings. When superensemble weights were stratified by lead time of forecast relative to the observed outbreak peak, the resulting forecast rankings dramatically improved, with 54.3% of superensemble forecasts receiving a rank of 1 through 4, surpassing all individual forecast methods (

In forecasts of peak incidence, the baseline superensemble forecast gave the best prediction 10.8% of the time, which was more frequent than all individual forecasts except BWO, which was the highest ranked method for 17.1% of predictions. The superensemble was among the 4 worst forecasts less than 0.5% of the time, compared to 17.7% of BWO forecasts (

For forecasts of total incidence, the baseline superensemble forecast had an average number of first place forecasts (5%), and was most often ranked between 7 and 13. However, as with the other two metrics, the superensemble had far fewer low rankings than any individual method. Among individual ensemble models, those using the SEIRS structure were ranked highest. BWO, pMCMC-SEIR, pMCMC-SEIRS, PF-SEIR and PF-SEIRS had frequent rankings in both the top 4 and the bottom 4.

Disagreement between competing forecasts of infectious disease outbreaks presents an obstacle to the interpretation and utilization of such forecasts. Here, we have presented a method for reconciling the disagreement among forecasts, while simultaneously improving overall forecast accuracy. We have shown that overall forecast accuracy for the timing and magnitude of peak influenza transmission is improved by combining individual forecasts into weighted-average superensemble forecasts. These superensemble forecasts were, on average, more accurate than any individual forecast method. In particular, the superensemble was less prone to producing a poor forecast. The advantage of the superensemble approach increases in circumstances where the relative accuracy of individual forecasts varies according to characteristics of the outbreak, or the location being forecast.

The 21 individual forecast methods compared in this study varied in their performance, as well as their reliability. The SEIRS dynamical model coupled with the EKF, EAKF and RHF ensemble filter methods were consistently among the better individual forecasts. Other forecast methods, namely pMCMC-SEIRS and pMCMC-SEIR, and to a lesser extent, BWO, performed inconsistently in that predictions were either among the best or the worst of the competing forecasts. This type of inconsistent performance presents a challenge to the superensemble approach, as the good forecasts can cause the forecast to receive a relatively high weighting in the superensemble; however, by identifying the circumstances that lead to differences in relative forecast performance, adaptive weighting can then be employed to variably weight an individual forecast method highly when it is prone to perform well and discount it in other circumstances. Here we found that the performance of individual forecast methods varied according to geographic location, influenza season, and the timing of the forecast.

We found that the timing of the forecast with respect to the outbreak peak was an important factor in determining relative forecast accuracy; consequently, stratifying superensemble weights by the actual lead time of the forecasts led to improvements in superensemble forecast accuracy. This improvement outweighed the gains made by simply eliminating the two most inconsistent forecast methods (pMCMC-SEIR and pMCMC-SEIRS). While the idealized process of weighting individual forecasts by actual forecast lead is not possible in real time, stratifying weights by forecast-predicted lead or simply calendar week, which can serve as real-time proxies for actual forecast lead, proved beneficial in improving forecast accuracy.

Stratifying superensemble weights by HHS region improved forecasts of peak and total incidence. This benefit may be related to regional differences in baseline and seasonal levels of influenza activity, or could be reflective of differences in the progression of influenza among regions. These findings provide a robust methodology for generating superensemble forecasts of influenza and other infectious diseases; however, the superensemble weights and the optimal stratification partitioning must be continually reevaluated and updated as new years of data become available, or as the geographical scale or resolution are altered.

While this study focused on the forecast of point estimate outcomes, the superensemble approach can also be used to produce probability distribution functions of target metrics. For example, the superensemble forecasts presented here were associated with reasonable credible intervals, which were influenced by the choice of stratification variable (

The algorithm used to create the superensemble is flexible, and can combine any number and any type of individual forecast method, provided retrospective forecasts are available for superensemble training. These findings can thus be applied operationally to competing forecasts of infectious disease in order to improve forecast accuracy, and to present a streamlined prediction to public health decision-makers.

Regional influenza activity is monitored by the U.S. Centers for Disease Control and Prevention (CDC) through the U.S. Outpatient Influenza- like Illness Surveillance Network (ILINet). The CDC provides weekly near real-time estimates of regional influenza-like illness (ILI), defined as the number of patients with flu-like symptoms (fever with sore throat and/or with cough) divided by the total number of patient visits at ILINet outpatient healthcare facilities in order to account for temporal and spatial variability in patient volume and reporting rates of health care providers [

ILI is not specific to influenza, as it encompasses a range of respiratory infections. The World Health Organization and National Respiratory and Enteric Virus Surveillance System (NREVSS) provide weekly reports of the proportion of laboratory-confirmed positive tests for influenza virus. A more specific estimate of influenza activity can be obtained by multiplying ILI with corresponding weekly regional viral isolation information, resulting in a measure we refer to as ILI+, defined as the number of influenza positive patients per 100 patient visits. As in our previous studies, we use city and state level GFT ILI estimates multiplied by regional NREVSS viral isolation rates to obtain ILI+, as the metric of observed influenza incidence [

We produced weekly forecasts of three target metrics for each influenza outbreak: the highest observed weekly ILI+ (peak incidence); the week during which peak ILI+ occurred (peak week); and the total ILI+ over the influenza season, which we define as a 20-week period beginning on the 45^{th} calendar week of the year (total incidence).

Weekly forecasts of influenza outbreak trajectories, outbreak peak ILI+, total ILI+ and outbreak peak timing were produced using 21 different forecast methods. Of the 21 methods, 20 are variations of a mathematical model of disease transmission coupled with a data assimilation, or filtering, method, and 1 is a statistical model based on historically observed outbreaks. We generated retrospective forecasts for 95 cities and the 48 contiguous United States with available records during the 2005–2006 through 2014–2015 influenza seasons, excluding the pandemic seasons of 2008–2009 and 2009–2010. Pandemic seasons were excluded because the individual forecast systems used in this study were designed specifically for seasonal outbreaks. While the model-filter forecasts could be adapted to forecast pandemics or irregularly timed outbreaks (for example [

Our previous studies using model-filter forecasts of influenza feature a humidity-forced Susceptible-Infectious-Recovered-Susceptible (SIRS) model to simulate influenza transmission [

The transmission rate _{0} _{0}_{0min} and _{0max} are the minimum and maximum daily basic reproductive numbers, respectively, and

In addition to the SIRS model, we consider three alternate model structures: Susceptible-Infectious-Recovered (SIR); Susceptible-Exposed-Infectious-Recovered (SEIR); and Susceptible-Exposed-Infectious-Recovered-Susceptible (SEIRS). The Exposed compartment in the SEIR and SEIRS models represent a latent period of infection. The equations describing these additional model structures are provided in

The mathematical models described above are coupled with filter methods for the weekly assimilation of ILI+ observations and optimization of the model state variables and parameters. These filters use Bayesian inference to calculate the posterior conditional distribution of parameters and state variables given observed ILI+, based on the prior distribution of the model ensemble and the conditional distribution of observed ILI+ given model parameters and state variables.

Five different data assimilation methods were used with each of the four model structures. These consist of three ensemble filter methods—the ensemble Kalman filter (EKF)[

The combined model-filter system recursively updates the model state variable and parameter estimates with each new weekly ILI+ observation. Through this process, these estimates of the observed and unobserved variables and parameters values are nudged closer to their true values. Following the assimilation of the most recent weekly ILI+ observation, the optimized set of model simulations is propagated forward in free simulation through the remainder of the season, producing a forecast of future ILI+ observations, from which we calculate the predicted timing and magnitude of peak influenza incidence, and total incidence.

The final forecast method, which we call Bayesian weighted outbreaks (BWO), is a statistical method that uses Bayesian model averaging to describe the trajectory of ILI+ for the outbreak of interest as a weighted average of outbreak trajectories from prior seasons. This method has been used in weather forecasting[

The pool of candidate trajectories used in the BWO retrospective forecasts included ILI+ observations from all 145 locations (48 states and 97 cities) available for the seasons prior to the season being forecast (e.g. retrospective forecasts of the 2005–2006 season considered ILI+ trajectories from 2003–2004 and 2004–2005—seasons 1 and 2), excluding the pandemic years of 2008–2009 and 2009–2010. More generally, retrospective BWO forecasts of influenza season

Superensemble forecasts were created for the 2005–2006 through 2014–2015 influenza seasons (excluding the pandemic seasons 2008–2009 and 2009–2010) by taking the weighted average of the 21 individual weekly forecasts for each location. The superensemble weights, which dictate the contribution of each individual forecast to the superensemble, are determined using maximum likelihood estimation of the conditional probability distribution function (PDF) over a selected number of training forecasts:
_{k}_{k,m}_{k,m} is the probability that individual forecast method _{k}_{k,m}_{k,m}, given that _{k,m} is the most accurate forecast of _{m}. This conditional PDF is assumed to be normal with mean _{k,m} and standard deviation σ. For simplicity, σ is assumed equal for all individual forecasts, and is determined through the maximum likelihood estimation of _{k,m}, which serve as the superensemble weights (see Raftery et al.[

Superensemble weights for season _{BWO,m}, are produced using a leave-one-out approach for training seasons 1 through N-1. That is, training forecasts for each season from 1 through N-1 were constructed using trajectories using all other seasons between 1 and N-1. The number and diversity of candidate trajectories increases over time, as each subsequent year adds 145 additional ILI+ trajectories to the pool. The model-filter forecasts do not use historical observations, and thus do not require a leave-one-out approach for training forecasts.

The superensemble weights _{k,m} are then applied to the point estimates of the target metric from the 21 individual forecasts, _{k}, for the time

The probability distribution function of the superensemble forecast obtained from ^{2}) [

The baseline superensemble forecasts were made by applying a single set of superensemble weights across all locations and times within an influenza season. However, based on previous analyses of individual forecast system performance (e.g. [

A final set of forecasts was produced by stratifying superensemble weights by lead time relative to the actual peak (weeks between the week of forecast initiation and the week of the true peak). While this weighting scheme could not be implemented in an operational real-time forecast, it is useful to know how the superensemble would perform under this idealized condition, as this may represent an upper bound to improvements that can be achieved using a weighting scheme based on forecast timing.

The method for stratifying superensemble weights consisted of dividing forecasts into bins according to the variable of interest. We then obtained weights for each bin by including only training forecasts falling into that bin in the algorithm described in

Forecasts stratified by lead time (actual or predicted) were grouped using the following bin edges, where negative values indicate weeks prior to the peak and positive numbers are weeks after the peak: [<-8, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5–8, 9–12, >12]. We selected a fine resolution of 1 week around peak, with wider bins at either end, as fewer outbreaks have lead times in these categories. Actual lead time is simply the week of the forecast minus the week that peak is eventually observed. Forecast predicted lead time is calculated by taking the mean prediction of peak week from the 21 individual forecasts, and subtracting this mean value from the week of forecast.

Geographic regions were grouped according to the ten US Health and Human Services Regions, as these are the standard geographical groupings used by the CDC to describe influenza activity. Calendar week groupings were delineated by individual weeks. Population density and populations size were arbitrarily binned into quintiles for cities and terciles for states.

(PDF)

Each line shows the results of one forecast, with grey dotted lines representing the 21 individual forecasts and colored lines representing superensemble forecasts. SE-baseline refers to the baseline superensemble forecast, whereas SE-week, SE-region, SE-forlead and SE-actlead refer to superensemble forecasts with weights stratified by forecast week, HHS region, lead relative to predicted peak, and lead relative to observed peak, respectively.

(TIF)

The weekly SE-baseline and SE-week forecasts are shown for a sample outbreak. 95% credible intervals are indicated by the shaded areas.

(TIF)

The points on the graph show the percent of observations falling within the specified credible intervals of the superensemble forecasts.

(TIF)

Same as

(TIF)