^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{7}

^{8}

^{9}

Co-author Willy Aspinall is the owner of Aspinall & Associates. There are no patents, products in development or marketed products to declare. This does not alter the authors' adherence to all the PLOS ONE policies on sharing data and materials.

Conceived and designed the experiments: WPA RMC AH TH SH. Performed the experiments: WPA RMC AH TH SH. Analyzed the data: WPA RMC TH. Contributed reagents/materials/analysis tools: WPA RMC AH TH SH. Wrote the paper: WPA RMC AH TH SH.

For many societally important science-based decisions, data are inadequate, unreliable or non-existent, and expert advice is sought. In such cases, procedures for eliciting structured expert judgments (SEJ) are increasingly used. This raises questions regarding validity and reproducibility. This paper presents new findings from a large-scale international SEJ study intended to estimate the global burden of foodborne disease on behalf of WHO. The study involved 72 experts distributed over 134 expert panels, with panels comprising thirteen experts on average. Elicitations were conducted in five languages. Performance-based weighted solutions for target questions of interest were formed for each panel. These weights were based on individual expert’s statistical accuracy and informativeness, determined using between ten and fifteen calibration variables from the experts' field with known values. Equal weights combinations were also calculated. The main conclusions on expert performance are: (1) SEJ does provide a science-based method for attribution of the global burden of foodborne diseases; (2) equal weighting of experts per panel increased statistical accuracy to acceptable levels, but at the cost of informativeness; (3) performance-based weighting increased informativeness, while retaining accuracy; (4) due to study constraints individual experts’ accuracies were generally lower than in other SEJ studies, and (5) there was a negative correlation between experts' informativeness and statistical accuracy which attenuated as accuracy improved, revealing that the least accurate experts drive the negative correlation. It is shown, however, that performance-based weighting has the ability to yield statistically accurate and informative combinations of experts' judgments, thereby offsetting this contrary influence. The present findings suggest that application of SEJ on a large scale is feasible, and motivate the development of enhanced training and tools for remote elicitation of multiple, internationally-dispersed panels.

In 2007, the World Health Organization (WHO) formed the Foodborne Disease Burden Epidemiology Reference Group (FERG) with the objective of estimating the global burden of diseases acquired from food [

The pathways by which foodborne disease hazards may reach humans are shown in

Attribution diagram: transmission routes to the point-of-exposure included in the expert elicitation [from ref. 2]. It is assumed that the pathways are mutually exclusive and exhaustive.

Food sub-pathways to the point-of-entry into household, included in the expert elicitation. It is assumed that the food subpathways are mutually exclusive and exhaustive.

WHO sub-regions: expert panels provided estimates for each sub-region, from [

Detailed knowledge of the pathways by which hazards reach humans affords the best opportunity for targeted intervention. However, the relative frequency with which a specific foodborne hazard exploits a given pathway will depend strongly on geography, local diet, sanitary conditions and general public health, among many other factors. Various source attribution methods exist, including microbiological and epidemiological methods, expert elicitation and methods that aim to integrate data from these approaches. Each approach has its own strengths and weaknesses [

This major exercise in international expert elicitation has provided a unique opportunity to gain new insights into the challenges of using SEJ for this purpose, and to assess the strengths and limitations of the Classical Model [

Companion articles [

Structured Expert Judgment (SEJ) is designed to convert diffuse and possibly conflicting sources of information into actionable signals of potential use to public health professionals, and to do so in a manner consistent with the best applicable scientific principles. In the WHO FERG SEJ elicitation, the Classical Model [

Shannon relative information is used because it is non-negative, scale invariant, tail insensitive, slow and familiar. Slowness implies that large variations in an expert's quantile values produce only modest changes in his/her informativeness score. A difference of a factor 2 in informativeness is a very noticeable difference. Parenthetically, information measures with physical dimensions, such as the standard deviation, or the width of prediction intervals [

Statistical accuracy is scored between 0 and 1 (higher is better) and is a very fast function. This means that the normalized product of statistical accuracy and informativeness is dominated by the statistical accuracy score. This is by design, as high informativeness should not counterbalance poor statistical accuracy. For the expert panels in the WHO data, the number of calibration variables is sufficient to distinguish good and poor statistical accuracy.

The product of statistical accuracy and informativeness constitutes the ‘combined score’ which is proportional to that expert’s performance-based weight, subject to optimization, whereby an optimal statistical accuracy threshold is chosen beneath which experts are un-weighted. The optimized weight satisfies a long-run proper scoring rule constraint whereby an expert maximizes his long-run expected score by and only by stating his true opinions.

In the Classical Model, expert responses can be combined using two forms of performance-based weights: 1) ‘global weights’, which assign an overall informativeness score to each expert based on all the calibration variables and all target questions; and 2) ‘item weights’, which modulate combination weights by using item-specific individual expert information scores. Global weights solutions were determined for the WHO study and these are referred to here as the 'Performance-Weighted Decision Maker’ or PW DM (for details, see [

Any combination of experts’ assessments may be applied to the calibration variables and scored with respect to statistical accuracy and informativeness. When the calibration variables used for comparison are also used to derive the weights for PW DM, this is ‘in-sample’ testing. For ‘out-of-sample’ comparison, the weights for the PW DM are derived from a subset of variables, a training set’, and PW DM performance is measured on a separate, distinct ‘test set’ of variables. The Classical Model has been extensively reviewed both with regard to in-sample and out-of-sample performance (see [

The WHO elicitation is a large-scale application of SEJ involving 72 experts from all parts of the world, distributed over 134 subject-matter panels. The sheer scale of this activity imposed many new constraints on the SEJ process whilst, at the same time, affording an opportunity to gain new insights into SEJ strengths and weaknesses.

The choice of the Classical Model was largely driven by its emphasis on empirical validation with calibration variables. A secondary consideration was that the Classical Model seemed scalable to the global disease burden problem under the study constraints. Plenary meetings of the expert panels were ruled out by the geographical dispersion of the experts. Although face-to-face interviews are preferred for capturing experts' reasoning, this was also not possible given location, calendar and budget constraints. Further, to be operationally functional the elicitation protocols had to be translated into French, Russian, Spanish and Chinese. Elicitors fluent in these languages had to be identified and trained (training was conducted by S. Hoffmann and W. Aspinall). The distributions of numbers of experts per panel and number of calibration variables per panel are shown in

Left: histogram of number of panels for number of experts; right: histogram of number of panels for number of calibration variables.

Elicitations for the calibration questions were conducted online and, for the variables of interest, experts were requested to fill in and return spreadsheets in a single online session–the goal of the exercise was to capture experts' immediate cognitive uncertainty judgments on the basis of the information provided in the questions, and not as a test of their ability to research or access relevant information or data.

The calibration variables differed by panel and were aimed at two broad domains (biological and chemical hazards). Within these domains the questions differed by specific hazard. They also included questions on food supply, child mortality, water and sanitation, disease surveillance and dietary patterns. Because of the multiple factors affecting source attribution, panels included experts from different disciplines with different areas of expertise. Reference [

E.g. Among all WHO sub-regions, in 2010 what was the proportion of regional vegetable supply (tonnes) that was imported rather than produced domestically in the WHO sub-region with the highest such percentage? |

E.g. Based on WHO’s estimates, think of the country in the WHO African Region that had the largest percentage point decrease from 2000 to 2010 in all-cause under-5 mortality that was due to diarrhea. What was that percentage point |

E.g. What will be the rate per 100,000 population of laboratory confirmed human cases of campylobacteriosis in 2012 in all EU member states as reported in EFSA’s annual report? |

What did the UNEP Final Review of Scientific Information on Lead report in 2010 as the mean blood lead level for children in Nigeria? Please express your answer as positive micrograms per deciliter (μg/dL) |

E.g. Based on this FAO Food Balance Sheet data, in 2009 what was the mean percentage of rice in the national food supply available for human consumption for countries in the WHO South East Asia Region |

For all elicitors and for all WHO experts, this was their first exposure to this type of structured expert judgment elicitation. Combining this drawback with the remote implementation of the elicitation constituted the strongest departures from the preferred procedures for SEJ (Cooke et al., 2000). Other approaches [

In this section, we consider how the various expert panels performed in terms of Classical Model metrics for statistical accuracy and informativeness, and assess the properties of the resulting PW DMs and EW DMs, for aggregated judgments. A panel is a group of experts who answered target questions for a particular foodborne disease hazard, either for all regions or for one specified region. Target questions asked for attribution of all cases of foodborne disease caused via primary exposure pathways (e.g. water; food; air) and via specified food sub-pathways (e.g. beef; fruits; shellfish).

There was, as a consequence, considerable overlap in membership across the panels. Overall, there were 112 panels with distinct sets of experts, though many panels differed only with respect to a few expert members. In many of these, the optimal weighted combination would assign the same weights to the same set of experts, making the PW DMs identical, even though EW DMs might differ slightly. For some hazards, the same sets of panelists provided answers for multiple regions, so the same panels could be used multiple times. Many experts participated in several distinct panels. For these reasons, any statistical analysis of results that considers the panels as independent experiments is not possible.

A recent analysis of thirty-three disjoint panels in professionally contracted SEJ studies involving 321 experts in total post-2006 [

The WHO data is distinctive in that a large number of experts assessed variables that are similar, as all of them involved relative frequencies of the same pathways and sub-pathways. The statistical accuracy and informativeness scores can be compared across all experts and across all distinct panels. This affords a unique opportunity to study the interactions of these two scoring variables.

Statistical accuracy and informativeness for 72 WHO experts. The red vertical line demarcates the traditional 5% confidence level rejection threshold for statistical hypothesis testing, the horizontal blue line is solely for comparison with Figs

Statistical accuracy and informativeness of the Equal Weights panel EW DMs. The vertical red line denotes the traditional 5% confidence level rejection threshold; the horizontal blue line is solely for comparison with Figs

Comparing the horizontal (blue) lines denoting information score equal to unity,

Statistical accuracy and informativeness of Performance Weights (PW DM) and Equal Weights Panel (EW DM), and corresponding DM joint scores (symbol size); the thin grey lines join PW DM and EW DM solutions for individual Panels (see text). The vertical red line denotes the traditional 5% confidence level rejection threshold; the horizontal blue line is solely for comparison with Figs

The average informativeness score of the PW DM is 1.14, more than twice the average informativeness score of the EW DM (0.52). The statistical accuracy is degraded somewhat, relative to EW DM, but not catastrophically. On only one of the 112 panels is PW DM's statistical accuracy score below 0.045, none were below 0.01. Considering each PW DM panel as a statistical hypothesis, the statistical accuracy scores are the p-values for rejecting the hypothesis that a PW DM is statistically accurate. Rounding the p-values to two digits, only one panel would be rejected at the 5% level and none would be rejected at the 1% level. If the panels were independent (which they are not) and statistically accurate, we should expect six panels’ p-values to fall below 5%.

This large set of experts assessing multiple similar items allows a more detailed examination than hitherto of the relationship between information and statistical accuracy. The data presented in ^{-4}), thereby confirming that statistical accuracy and informativeness are negatively correlated. Such negative correlations have been often observed in individual studies, but in those cases the relatively small numbers of experts, combined with the intrinsic non-comparability of informativeness scores, have precluded reliable quantitative conclusions.

Because of this negative association, simple weighting schemes that consider only experts' informativeness will tend to produce combinations which are very inaccurate, statistically [

Another feature emerges from

Running rank correlation between informativeness and statistical accuracy for Experts k = 1 … 72, with experts ordered by increasing statistical accuracy.

The negative rank correlation attenuates as the selection is restricted to experts who are more accurate statistically. In other words, the observed negative association between informativeness and statistical accuracy is driven by the

This simple observation explains why the Classical Model, which restricts weighting to the most accurate experts, can produce PW DM combinations that are simultaneously informative and statistically accurate.

The WHO FERG study on the global burden of foodborne diseases provides new perspectives on efficacy of SEJ. A very large number of elicitations were conducted of experts distributed around the world, by elicitors with no prior experience who had been rapidly trained. Elicitations were not conducted face-to-face. Calibration variables drawn from the experts' fields were used to gauge expert performance and to enable performance-based scoring combinations of judgments to be applied to the target items of interest. The statistical accuracies of these experts generally were lower than is typical in a dedicated Classical Model SEJ, a fact plausibly explained by the operational limitations in the present, global, elicitation process, unlike more intensive approaches.

In spite of these limitations, the statistical accuracy of both the Performance Weights and Equal Weights DM combinations were much better than that of the experts themselves, and well within the range of acceptability. Informativeness of the Equal Weights combination was strongly degraded, but informativeness of the performance-based combinations was comparable to that of the experts themselves.

Most significant in this dataset is the negative rank correlation between informativeness and statistical accuracy, and the finding that this correlation weakens when expert selection is restricted to the statistically more accurate experts.

These results motivate the development and deployment of enhanced elicitor and expert training, and advanced tools for remote elicitation of multiple, internationally-dispersed panels–demand for which is growing in many disciplines.

The demonstration that SEJ applications on this scale are feasible and potentially successful offers new options for many scientific and technical areas in which the inchoate information embedded in widely dispersed experts can be actively accessed for the benefit of decision- or policy makers.

The research questions addressed in this study may be answered as follows:

Yes. The Classical Model enables empirical control, however this control is

Partially. As with other studies, equal weighting of experts per panel raised statistical accuracy to acceptable levels, but at the cost of informativeness. Performance-based weighting increased informativeness without sacrificing accuracy. For more than 95% of the expert panels, the hypothesis that these Performance Weights combinations yield statistically accurate probability statements would not be rejected at the 5% level. Whereas Equal Weights combinations were much less informative than the experts themselves, the informativeness measures of Performance Weights solutions were comparable to those of the experts. This pattern is consistent with comparable SEJ studies. On the other hand, the overall statistical accuracy of the experts in this study was lower than that found in comparable Classical Model studies.

Yes. This study finds that the negative correlation between informativeness and statistical accuracy is attenuated as statistical accuracy improves. This augurs well for performance-based combination methods that restrict weighting to subsets with statistical accuracy, and among these, reward informativeness.

The EXCEL file contains experts’ names and scores per panel. “pwg” denotes global performance weights, pwi denotes item specific performance weights, “ew” denotes equal weights.

(XLSX)

The authors are very grateful to the specialists who contributed their expertise and time to this study (see

WPA was supported in part at Bristol University by the Natural Environment Research Council (Consortium on Risk in the Environment: Diagnostics, Integration, Benchmarking, Learning and Elicitation—CREDIBLE; grant number NE/J017450/1).

This study was commissioned and paid for by the World Health Organization (WHO). Copyright in the original work on which this article is based belongs to WHO. The authors have been given permission to publish this article. The authors alone are responsible for the views expressed in this publication and they do not necessarily represent the views, decisions or policies of the World Health Organization, the U.S. Department of Agriculture or the Centers for Disease Control and Prevention, Atlanta, USA.