^{1}

^{2}

^{1}

^{2}

^{3}

^{1}

^{2}

^{1}

^{2}

^{4}

^{1}

^{2}

^{1}

^{2}

The authors have declared that no competing interests exist.

Real-time forecasts based on mathematical models can inform critical decision-making during infectious disease outbreaks. Yet, epidemic forecasts are rarely evaluated during or after the event, and there is little guidance on the best metrics for assessment. Here, we propose an evaluation approach that disentangles different components of forecasting ability using metrics that separately assess the calibration, sharpness and bias of forecasts. This makes it possible to assess not just how close a forecast was to reality but also how well uncertainty has been quantified. We used this approach to analyse the performance of weekly forecasts we generated in real time for Western Area, Sierra Leone, during the 2013–16 Ebola epidemic in West Africa. We investigated a range of forecast model variants based on the model fits generated at the time with a semi-mechanistic model, and found that good probabilistic calibration was achievable at short time horizons of one or two weeks ahead but model predictions were increasingly unreliable at longer forecasting horizons. This suggests that forecasts may have been of good enough quality to inform decision making based on predictions a few weeks ahead of time but not longer, reflecting the high level of uncertainty in the processes driving the trajectory of the epidemic. Comparing forecasts based on the semi-mechanistic model to simpler null models showed that the best semi-mechanistic model variant performed better than the null models with respect to probabilistic calibration, and that this would have been identified from the earliest stages of the outbreak. As forecasts become a routine part of the toolkit in public health, standards for evaluation of performance will be important for assessing quality and improving credibility of mathematical models, and for elucidating difficulties and trade-offs when aiming to make the most useful and reliable forecasts.

During epidemics, reliable forecasts can help allocate resources effectively to combat the disease. Various types of mathematical models can be used to make such forecasts. In order to assess how good the forecasts are, they need to be compared to what really happened. Here, we describe different approaches to assessing how good forecasts were that we made with mathematical models during the 2013–16 West African Ebola epidemic, focusing on one particularly affected area of Sierra Leone. We found that, using the type of models we used, it was possible to reliably predict the epidemic for a maximum of one or two weeks ahead, but no longer. Comparing different versions of our model to simpler models, we further found that it would have been possible to determine the model that was most reliable at making forecasts from early on in the epidemic. This suggests that there is value in assessing forecasts, and that it should be possible to improve forecasts by checking how good they are during an ongoing epidemic.

Forecasting the future trajectory of cases during an infectious disease outbreak can make an important contribution to public health and intervention planning. Infectious disease modellers are now routinely asked for predictions in real time during emerging outbreaks [

The growing importance of infectious disease forecasts is epitomised by the growing number of so-called forecasting challenges. In these, researchers compete in making predictions for a given disease and a given time horizon. Such initiatives are difficult to set up during unexpected outbreaks, and are therefore usually conducted on diseases known to occur seasonally, such as dengue [

In theory, infectious disease dynamics should be predictable within the timescale of a single outbreak [

The most recent example of large-scale outbreak forecasting efforts was during the 2013–16 Ebola epidemic, which vastly exceeded the burden of all previous outbreaks with almost 30,000 reported cases resulting in over 10,000 deaths in the three most affected countries: Guinea, Liberia and Sierra Leone. During the epidemic, several research groups provided forecasts or projections at different time points, either by generating scenarios believed plausible, or by fitting models to the available time series and projecting them forward to predict the future trajectory of the outbreak [

Traditionally, epidemic forecasts are assessed using aggregate metrics such as the mean absolute error (MAE) [

We produced weekly sub-national real-time forecasts during the Ebola epidemic, starting on 28 November 2014. Plots of the forecasts were published on a dedicated web site and updated every time a new set of data were available [

Here, we apply assessment metrics that elucidate different properties of forecasts, in particular their probabilistic calibration, sharpness and bias. Using these methods, we retrospectively assess the forecasts we generated for Western Area in Sierra Leone, an area that saw one of the greatest number of cases in the region and where our model informed bed capacity planning.

This study has been approved by the London School of Hygiene & Tropical Medicine Research Ethics Committee (reference number 8627).

Numbers of suspected, probable and confirmed Ebola cases at sub-national levels were initially compiled from daily

We used a semi-mechanistic stochastic model of Ebola transmission described previously [_{t} is the time-varying transmission rate, _{t} is the Wiener process and _{0,t} at any time was obtained by multiplying _{t} with the average infectious period. In fitting the model to the time series of cases we extracted posterior predictive samples of trajectories, which we used to generate forecasts.

Each week, we fitted the model to the available case data leading up to the date of the forecast. Observations were assumed to follow a negative binomial distribution. Since the _{0} (uniform prior within (1, 5)), initial number of infectious people (uniform prior within (1, 400)), overdispersion of the (negative binomial) observation process (uniform prior within (0, 0.5)) and volatility of the time-varying transmission rate (uniform prior within (0, 0.5)). We confirmed from the posterior distributions of the parameters that these priors did not set any problematic bounds. Samples of the posterior distribution of parameters and state trajectories were extracted using particle Markov chain Monte Carlo [

We used the samples of the posterior distribution generated using the Monte Carlo sampler to produce predictive trajectories, using the final values of estimated state trajectories as initial values for the forecasts and simulating the model forward for up to 10 weeks. While all model fits were generated using the same model described above, we tested a range of different predictive model variants to assess the quality of ensuing predictions. We tested variants where trajectories were stochastic (with demographic stochasticity and a noisy reporting process), as well as ones where these sources of noise were removed for predictions. We further tested predictive model variants where the transmission rate continued to follow a random walk (unbounded, on a log-scale), as well as ones where the transmission rate stayed fixed during the forecasting period. When the transmission rate remained fixed for prediction, we tested variants where we used the final value of the transmission rate and ones where this value was averaged over a number of weeks leading up to the final fitted point, to reduce the potential influence of the last time point, at which the transmission rate may not have been well identified. We tested variants where the predictive trajectory was based on the final values and start at the last time point, and ones where it started at the penultimate time point, which could, again, be expected to be better informed by the data. For each model and forecast horizon, we generated point-wise medians and credible intervals from the sample trajectories.

To assess the performance of the semi-mechanistic transmission model we compared it to three simpler null models: two representing the constituent parts of the semi-mechanistic model, and a non-mechanistic time series model. For the first null model, we used a _{t} are observations at times _{c}_{h}_{0} is the basic reproduction number, Δ = 1/_{0}, which were inferred using Markov-chain Monte Carlo with the same priors as in the semi-mechanistic model.

For the second null model, we used an

Lastly, we used a null model based on a non-mechanistic Bayesian autoregressive AR(1) time series model:
_{α} and _{Y*} were estimated using Markov-chain Monte Carlo, and […] indicates rounding to the nearest integer. An alternative model with Poisson distributed observations was discarded as it yielded poorer predictive performance.

The deterministic and unfocused models were implemented in

The paradigm for assessing probabilistic forecasts is that they should maximise the sharpness of predictive distributions subject to calibration [

_{t} is the observed data point at time _{1}, …, _{n}, _{t} is the (continuous) predictive cumulative probability distribution at time _{t} then the forecasts _{t} are said to be _{t} = _{t} at all times _{t} are distributed uniformly.

In the case of discrete outcomes such as the incidence counts that were forecast here, the PIT is no longer uniform even when forecasts are ideal. In that case a randomised PIT can be used instead:
_{t} is the observed count, _{t}(_{t}(−1) = 0 by definition and _{t} is the true cumulative probability distribution, then _{t} is standard uniform [_{t}. The resulting p-value was a reflection of how compatible the forecasts were with the null hypothesis of uniformity of the PIT, or of the data coming from the predictive probability distribution. We calculated the mean p-value of 10 samples from the randomised PIT and found the corresponding Monte-Carlo error to be negligible (maximum standard deviation: _{p} = 0.003). We considered that there was no evidence to suggest a forecasting model was miscalibrated if the p-value found was greater than a threshold of

All of the following metrics are evaluated at every single data point. In order to compare the forecast quality of models, they were averaged across the time series.

_{t}, and division by 0.675 ensures that if the predictive distribution is normal this yields a value equivalent to the standard deviation. The MAD (i.e., the MADN without the normalising factor) is related to the interquartile range (and in the limit of infinite sample size takes twice its value), a common measure of sharpness [_{t} to estimate sharpness.

We further assessed the _{t} = 0, whereas a completely biased model would yield either all predictive probability mass above (_{t} = 1) or below (_{t} = −1) the data.

We further evaluated forecasts using two _{t}.

The

For comparison, we also evaluated forecasts using the _{t}.

All scoring metrics used are implemented in the

The semi-mechanistic model used to generate real-time forecasts during the epidemic was able to reproduce the trajectories up to the date of each forecast, following the data closely by means of the smoothly varying transmission rate (

(A) Final fit of the reported weekly incidence (black line and grey shading) to the data (black dots). (B) Corresponding dynamics of the reproduction number (ignoring depletion of susceptibles). Point-wise median state estimates are indicated by a solid line, interquartile ranges by dark shading, and 90% intervals by light shading. The threshold reproduction number (_{0} = 1), determining whether case numbers are expected to increase or decrease, is indicated by a dashed line. In both plots, a dotted vertical line indicates the date of the first forecast assessed in this manuscript (24 August 2014).

The epidemic lasted for a total of 27 weeks, with forecasts generated starting from week 3. For

(A) Calibration of predictive model variants (p-value of the Anderson-Darling test of uniformity) as a function of the forecast horizon. Shown in dark red is the best calibrated forecasting model variant (corresponding to the second row of

Calibration (p-value of the Anderson-Darling test of uniformity) of deterministic and stochastic predictive model variants starting either at the last data point or one week before, with varying (according to a Gaussian random walk) or fixed transmission rate either starting from the last value of the transmission rate or from an average over the last 2 or 3 weeks, at different forecast horizons up to 4 weeks. The p-values highlighted in bold reflect predictive models with no evidence of miscalibration. The second row corresponds to the highlighted model variant in

Predictive model variant | Forecast horizon (weeks) | ||||||
---|---|---|---|---|---|---|---|

Stochasticity | Start | Transmission | Averaged | 1 | 2 | 3 | 4 |

deterministic | at last data point | varying | no | 0.02 | <0.01 | ||

deterministic | at last data point | fixed | no | 0.03 | <0.01 | ||

deterministic | at last data point | fixed | 2 weeks | 0.03 | <0.01 | <0.01 | |

deterministic | at last data point | fixed | 3 weeks | <0.01 | <0.01 | <0.01 | |

deterministic | 1 week before | varying | no | 0.05 | 0.02 | <0.01 | <0.01 |

deterministic | 1 week before | fixed | no | 0.09 | 0.02 | <0.01 | <0.01 |

deterministic | 1 week before | fixed | 2 weeks | 0.09 | <0.01 | <0.01 | <0.01 |

deterministic | 1 week before | fixed | 3 weeks | 0.03 | <0.01 | <0.01 | <0.01 |

stochastic | at last data point | varying | no | 0.02 | 0.02 | <0.01 | <0.01 |

stochastic | at last data point | fixed | no | 0.02 | 0.02 | <0.01 | <0.01 |

stochastic | at last data point | fixed | 2 weeks | 0.01 | <0.01 | <0.01 | <0.01 |

stochastic | at last data point | fixed | 3 weeks | <0.01 | <0.01 | <0.01 | <0.01 |

stochastic | 1 week before | varying | no | <0.01 | <0.01 | <0.01 | <0.01 |

stochastic | 1 week before | fixed | no | <0.01 | <0.01 | <0.01 | <0.01 |

stochastic | 1 week before | fixed | 2 weeks | <0.01 | <0.01 | <0.01 | <0.01 |

stochastic | 1 week before | fixed | 3 weeks | <0.01 | <0.01 | <0.01 | <0.01 |

The calibration of the best semi-mechanistic forecast model variant (deterministic dynamics, transmission rate fixed and starting at the last data point) was better than that of any of the null models (

Metrics shown are (A) calibration (p-value of Anderson-Darling test, greater values indicating better calibration, dashed lines at 0.1 and 0.01), (B) bias (less bias if closer to 0), (C) sharpness (MAD, sharper models having values closer to 0), (D) RPS (better values closer to 0), (E) DSS (better values closer to 0) and (F) AE (better values closer to 0), all as a function of the forecast horizon.

The values shown are the same scores as in

Model | Calibration | Sharpness | Bias | RPS | DSS | AE |
---|---|---|---|---|---|---|

Semi-mechanistic | 91 | 0.13 | 31 | 9.2 | 42 | |

Autoregressive | 61 | -0.17 | 31 | 9.1 | 43 | |

Deterministic | 0.03 | 340 | 0.24 | 97 | 11 | 130 |

Unfocused | <0.01 | 41 | -0.024 | 35 | 13 | 47 |

Semi-mechanistic | 150 | 0.2 | 50 | 12 | 65 | |

Autoregressive | 0.03 | 77 | -0.18 | 43 | 9.9 | 60 |

Deterministic | <0.01 | 400 | 0.35 | 120 | 12 | 160 |

Unfocused | <0.01 | 42 | -0.044 | 48 | 16 | 61 |

Semi-mechanistic | 0.03 | 230 | 0.3 | 81 | 15 | 93 |

Autoregressive | 0.02 | 90 | -0.17 | 53 | 11 | 73 |

Deterministic | <0.01 | 490 | 0.45 | 160 | 13 | 210 |

Unfocused | <0.01 | 44 | -0.058 | 60 | 29 | 71 |

The semi-mechanistic and deterministic models showed a tendency to overestimate the predicted number of cases, while the autoregressive and null models tended to underestimate (

Focusing purely on the median forecast (and thus ignoring both calibration and sharpness), the absolute error (AE,

We lastly studied the calibration behaviour of the models over time; that is, using the data and forecasts available up to different time points during the epidemic (

Calibration scores of the forecast up to the time point shown on the x-axis. (A) Semi-mechanistic model variants, with the best model highlighted in dark red and other model variants are shown in light red. (B) Best semi-mechanistic model and null models. In both cases, 1-week (left) and 2-week (right) calibration (p-value of Anderson-Darling test) are shown.

Probabilistic forecasts aim to quantify the inherent uncertainty in predicting the future. In the context of infectious disease outbreaks, they allow the forecaster to go beyond merely providing the most likely future scenario and quantify how likely that scenario is to occur compared to other possible scenarios. While correctly quantifying uncertainty in predicted trajectories has not commonly been the focus in infectious disease forecasting, it can have enormous practical implications for public health planning. Especially during acute outbreaks, decisions are often made based on so-called “worst-case scenarios” and their likelihood of occurring. The ability to adequately assess the magnitude as well as the probability of such scenarios requires accuracy at the tails of the predictive distribution, in other words good calibration of the forecasts.

More generally, probabilistic forecasts need to be assessed using metrics that go beyond the simple difference between the central forecast and what really happened. Applying a suite of assessment methods to the forecasts we produced for Western Area, Sierra Leone, we found that probabilistic calibration of semi-mechanistic model variants varied, with the best ones showing good calibration for up to 2-3 weeks ahead, but performance deteriorated rapidly as the forecasting horizon increased. This reflects our lack of knowledge about the underlying processes shaping the epidemic at the time, from public health interventions by numerous national and international agencies to changes in individual and community behaviour. During the epidemic, we only published forecasts up to 3 weeks ahead, as longer forecasting horizons were not considered appropriate.

Our forecasts suffered from bias that worsened as the forecasting horizon expanded. Generally, the forecasts tended to overestimate the number of cases to be expected in the following weeks, as did most other forecasts generated during the outbreak [

There are trade-offs between achieving good outcomes for the different forecast metrics we used. Deciding whether the best forecast is the best calibrated, the sharpest or the least biased, or some compromise between the three, is not a straightforward task. Our assessment of forecasts using separate metrics for probabilistic calibration, sharpness and bias highlights the underlying trade-offs. While the best calibrated semi-mechanistic model variant showed better calibration performance than the null models, this came at the expense of a decrease in the sharpness of forecasts. Comparing the models using the RPS alone, the semi-mechanistic model of best calibration performance would not necessarily have been chosen. Following the paradigm of maximising sharpness subject to calibration, we therefore recommend to treat probabilistic calibration as a prerequisite to the use of forecasts, in line with what has recently been suggested for post-processing of forecasts [

Other models may have performed better than the ones presented here. Because we did not have access to data that would have allowed us to assess the importance of different transmission routes (burials, hospitals and the community) we relied on a relatively simple, flexible model. The deterministic SEIR model we used as a null model performed poorly on all forecasting scores, and failed to capture the downturn of the epidemic in Western Area. On the other hand, a well-calibrated mechanistic model that accounts for all relevant dynamic factors and external influences could, in principle, have been used to predict the behaviour of the epidemic reliably and precisely. Yet, lack of detailed data on transmission routes and risk factors precluded the parameterisation of such a model and are likely to do so again in future epidemics in resource-poor settings. Future work in this area will need to determine the main sources of forecasting error, whether structural, observational or parametric, as well as strategies to reduce such errors [

In practice, there might be considerations beyond performance when choosing a model for forecasting. Our model combined a mechanistic core (the SEIR model) with non-mechanistic variable elements. By using a flexible non-parametric form of the time-varying transmission rate, the model provided a good fit to the case series despite a high levels of uncertainty about the underlying process. Having a model with a mechanistic core came with the advantage of enabling the assessment of interventions just as with a traditional mechanistic model. For example, the impact of a vaccine could be modelled by moving individuals from the susceptible into the recovered compartment [

Epidemic forecasts played a prominent role in the response to and public awareness of the Ebola epidemic [

For forecast assessment to happen in practice, evaluation strategies must be planned before the forecasts are generated. In order for such evaluation to be performed retrospectively, all forecasts as well as the data, code and models they were based on should be made public at the time, or at least preserved and decisions recorded for later analysis. We published weekly updated aggregate graphs and numbers during the Ebola epidemic, yet for full transparency it would have been preferable to allow individuals to download raw forecast data for further analysis.

If forecasts are not only produced but also evaluated in real time, this can give valuable insights into strengths, limitations, and reasonable time horizons. In our case, by tracking the performance of our forecasts, we would have noticed the poor calibration of the model variant chosen for the forecasts presented to the public, and instead selected better calibrated variants. At the same time, we did not store the predictive distribution samples for any area apart from Western Area in order to better use available storage space, and because we did not deem such storage valuable at the time. This has precluded a broader investigation of the performance of our forecasts.

Research into modelling and forecasting methodology and predictive performance at times during which there is no public health emergency should be part of pandemic preparedness activities. To facilitate this, outbreak data must be made available openly and rapidly. Where available, combination of multiple sources, such as epidemiological and genetic data, could increase predictive power. It is only on the basis of systematic and careful assessment of forecast performance during and after the event that predictive ability of computational models can be improved and lessons be learned to maximise their utility in future epidemics.