The Alda score is commonly used to quantify lithium responsiveness in bipolar disorder. Most often, this score is dichotomized into “responder” and “non-responder” categories, respectively. This practice is often criticized as inappropriate, since continuous variables are thought to invariably be “more informative” than their dichotomizations. We therefore investigated the degree of informativeness across raw and dichotomized versions of the Alda score, using data from a published study of the scale’s inter-rater reliability (n = 59 raters of 12 standardized vignettes each). After learning a generative model for the relationship between observed and ground truth scores (the latter defined by a consensus rating of the 12 vignettes), we show that the dichotomized scale is more robust to inter-rater disagreement than the raw 0-10 scale. Further theoretical analysis shows that when a measure’s reliability is stronger at one extreme of the continuum—a scenario which has received little-to-no statistical attention, but which likely occurs for the Alda score ≥ 7—dichotomization of a continuous variable may be more informative concerning its ground truth value, particularly in the presence of noise. Our study suggests that research employing the Alda score of lithium responsiveness should continue using the dichotomous definition, particularly when data are sampled across multiple raters.

The Alda score is a validated index of lithium responsiveness commonly used in bipolar disorder (BD) research [

A common criticism that arises from this practice is that continuous variables should not be discretized by virtue of “information loss.” Indeed, discretizing continuous variables is widely viewed as an inappropriate practice [

Although the Manchia et al. [

Detailed description of data and collection procedures is found in Manchia et al. [

The total number of raters (_{r}) was 59.

Case Vignette | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Site | _{r} |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |

Gold standard | 8 | 9 | 6 | 7 | 9 | 3 | 5 | 9 | 3 | 9 | 5 | 1 | |

Halifax | 9 | 8.4 | 8.6 | 6.6 | 6.9 | 9.2 | 3 | 3.9 | 8.8 | 3.1 | 9.1 | 4.7 | 1.2 |

NIMH | 4 | 7.8 | 8.2 | 6.2 | 7 | 8.8 | 3.2 | 4 | 8.5 | 2.2 | 8.5 | 3.2 | 1.8 |

Poznan | 2 | 9 | 8.5 | 6.5 | 5.5 | 9 | 4 | 7.5 | 9 | 5 | 8 | 4.5 | 4.5 |

Dresden | 2 | 8.5 | 7.5 | 6 | 5 | 8.5 | 1.5 | 6 | 9 | 3.5 | 8.5 | 4 | 1.5 |

Japan | 4 | 8 | 8.2 | 4.8 | 6.5 | 8.5 | 2 | 3 | 8.5 | 1 | 8.2 | 4.5 | 1.5 |

Wuerzburg | 2 | 7.5 | 7.5 | 4 | 6.5 | 8 | 1.5 | 3 | 9 | 0 | 7 | 3 | 0.5 |

Cagliari | 3 | 7.7 | 9 | 4.3 | 7 | 5.7 | 4 | 1.3 | 9 | 0.7 | 7.3 | 4 | 2 |

San Diego | 2 | 7.5 | 8.5 | 7.5 | 7 | 9 | 5 | 7.5 | 8.5 | 3.5 | 8.5 | 6 | 3.5 |

Boston | 2 | 8.5 | 8.5 | 6 | 7 | 9 | 3 | 3.5 | 8.5 | 1.5 | 9 | 4 | 1 |

Gottingen | 2 | 9.5 | 9 | 4 | 6 | 9 | 1 | 1 | 9 | 1.5 | 9 | 4 | 3 |

Berlin | 1 | 7 | 9 | 4 | 6 | 9 | 2 | 3 | 8 | 0 | 7 | 0 | 2 |

Taipeh | 1 | 8 | 8 | 5 | 8 | 9 | 5 | 6 | 9 | 4 | 9 | 8 | 1 |

Prague | 1 | 7 | 9 | 4 | 8 | 9 | 3 | 6 | 9 | 3 | 9 | 6 | 1 |

Johns Hopkins | 7 | 8 | 8.7 | 5.3 | 5.9 | 8.3 | 2.7 | 2.4 | 9.1 | 2 | 8.3 | 4.4 | 1.1 |

Mayo | 6 | 8 | 8.2 | 6 | 8 | 9 | 4.2 | 3 | 9 | 4.2 | 8.8 | 3.7 | 0.3 |

Brasil | 3 | 8 | 8.3 | 5.3 | 6.3 | 8.7 | 2 | 4 | 9 | 4.3 | 8 | 4.7 | 0.7 |

Medellin | 4 | 7.5 | 9 | 5.5 | 6.5 | 5 | 2.5 | 4 | 7.2 | 4.8 | 8.8 | 1.2 | 2 |

Geneve | 3 | 7.7 | 8.7 | 6.7 | 5.3 | 9.7 | 5 | 6 | 8.7 | 1.3 | 9 | 3.7 | 0.3 |

In this analysis, we seek to evaluate whether discretization of the Alda score under the existing inter-rater reliability values preserves

Let ^{(k)} is multinomial with parameter vector ^{(k)} ∼ Dir(^{(k)} (i.e. the ratings become more “noisy”).

The posterior of ^{(k)} given ^{(k)} and

The dichotomized Alda scores are defined as ^{(k)} ∼ Multinomial(_{k}), and _{k} ∼ Dir(

Mutual information is a general measure of dependence that expresses the degree to which uncertainty about one variable is reduced by observation of another. Whereas the correlation coefficient depends on the existence of a linear association, MI can detect nonlinear relationships between variables by comparing their joint probability against the product of their marginal distributions.

Let

For the binarized classes, we have a prior of

The MI for these distributions can be computed as functions of the prior pseudocounts

Our primary hypothesis—that the dichotomized Alda score is more informative with greater observation uncertainty—is evaluated by determining whether _{ξ}[_{o}||_{*}] exceeds _{α}[_{o}||_{*}] as we increase the

The previous experiment regarding dichotomization of the raw Alda score did not fully capture the effect of dichotomization of a continuous variable, since the raw Alda score is still discrete (albeit with a larger domain of support). Thus, we sought to investigate whether dichotomization of a truly continuous, though asymmetrically reliable, variable would show a similar pattern of preserving MI and statistical power under higher levels of observation noise and agreement asymmetry.

The simplest synthetic dataset generated was merely a sample of regularly spaced points across the [0, 10] interval in both the x and y directions. This dataset was merely used to conduct a “sanity check” that our methods for computing MI correctly identified a value of 0. This was necessary since data with uniform random noise over the same interval will only yield MI of 0 in the limit of large sample sizes.

The main synthetic dataset accepted “ground truth” values ^{th} sample:
_{i}) (data are entirely uniform random noise when _{i}) when _{i}) governing the agreement between ground truth and observed is essentially a 1:1 correspondence between

We simulated two forms of diagonal spread. The first is constant across all values _{i}) is defined as
_{(l, u)}(⋅) is a function to ensure that all points remain within the [_{(l, u)}(⋅) reflects points at the [0, 10] bounds. In the symmetrical case, the data are all simply rescaled to lie in the [0, 10] interval.

Demonstration of the simulated synthetic data are shown in ^{th} synthetic dataset (given parameters

The x-axes all represent the ground truth value of the variable, and the y-axes represent the “observed” values. Data are depicted based on different values of a uniform noise parameter (0 ≤

Mutual information was computed for both continuous and dichotomized probability distributions on the data. Mutual information for the continuous distribution was computed by first performing Gaussian kernel density estimation (using Scott’s method for bandwidth selection) on the simulated dataset, and then approximating the following integral using Markov chain Monte-Carlo sampling:

Conversely, discrete MI was computed by first creating a 2-dimensional histogram by binning data based on a dichotomization threshold

Note that continuous MI will remain constant across

Association between the observed (^{(k)} observations and two-tailed statistical significance threshold ^{−1}(⋅) are the cumulative distribution function and quantile functions for a standard normal distribution, and

Under a dichotomization of ^{(k)} is the total number of observations in sample

The statistical power of Fisher’s exact test under this setup and a two-tailed significance threshold of

The central aspect of this analysis is comparison of the dichotomized and continuous MI across values of the dichotomization threshold

Statistical power of the Pearson correlation coefficient and Fisher’s exact test were computed across symmetrical (

Mutual information experiments were conducted in Mathematica v. 12.0.0 (Wolfram Research, Inc.; Champaign, IL). Experiments evaluating the statistical power under classical tests of continuous and dichotomous association were conducted in the Python programming language. Code for analyses are also provided in

Histograms of the observed Alda scores for each of the gold standard vignette values are depicted in

Each histogram represents the distribution of ratings (_{r} = 59) for a single one of twelve assessment vignettes. The gold standard (“ground truth”) Alda score, obtained by the Halifax consensus sample, is depicted as the title for each histogram. Plots in blue are those for vignettes with gold standard Alda scores less than 7, which would be classified as “non-responders” under the dichotomized setting. Vignettes with gold standard Alda scores ≥ 7 are shown in red, and represent the dichotomized group of lithium responders.

Panels A-C show the inferred joint distributions of the observed (_{o} for raw, _{o} for discrete) and gold standard (_{*} for raw, _{*} for discrete) values at different levels of observation noise (

X-axes represent the dichotomization thresholds at which we recalculate the dichotomized MI. Mutual information is depicted on the y-axes. Plot titles indicate the different diagonal spread (

Columns correspond to the level of uniform “overall” noise (

The present study makes two important contributions. First, using a sample of 59 ratings obtained using standardized vignettes compared to a consensus-defined gold standard [

Some have argued that the existence of categorical structure in one’s data [

The Alda score is more broadly used as a target variable in both predictive and associative analyses, and not as a predictor variable, which is an important departure from most analyses against dichotomization. Since there is no valid and reliable biomarker of lithium response, these cases must rely on the Alda score-based definition of lithium response as a “ground truth” target variable. In the case of predicting lithium response, where these ground truth labels are collected from multiple raters across different international sites, variation in lithium response scoring patterns across centres might further accentuate the extant between-site heterogeneity.

To this end, inter-individual differences in subjective rating scales may be more informative about the raters than the subjects, and one may wish to use dichotomization to discard this nuisance variance [

An important criticism of continuous variable dichotomization is that it may impede comparability of results across studies, both in terms of diminishing power and inflating heterogeneity [

Our study thus provides a unique point of support for the dichotomized Alda score insofar as we show that the retention of MI and frequentist statistical power is likely due to asymmetrical reliability across the range of scores. Our analyses show that there is a range of Alda scores (those identifying good lithium responders; scores ≥ 7) for which scores correspond more tightly to a consensus-defined gold standard in a large scale international consortium. Conversely, this asymmetry implies that Alda scores at the lower end of the range will carry greater uncertainty (

More generally our study showed that if reliability of a measure is particularly high at one tail of its range, then a “tail split” dichotomization can outperform even the continuous representation of the variable. This presents an important counterexample to previous authors, such as Cohen [

Our study has several limitations. First, our sample size for the re-analysis of the Alda score reliability was relatively small, and sourced from highly specialized raters involved in lithium-specific research. However, one may consider this sample as representative of the “best case scenario” for the Alda score’s reliability. It is likely that further expansion of the subject population would introduce more noise into the relationship between ground truth and observed Alda scores. It is likely that most of this additional disagreement would be observed for lower Alda scores, since (A) there are simply more potential item combinations that can yield an Alda score of 5 than an Alda score of 9, for example, and (B) unambiguously excellent lithium response is a phenomenon so distinct that some question whether lithium responsive BD may constitute a unique diagnostic entity [

Our study is also limited by the fact that theoretical analysis was largely simulation-based, and thus cannot offer the degree of generalizability obtained through rigorous mathematical proof. Nonetheless, our study offers sufficient evidence—in the form of a counterexample—to show that there exist scenarios in which dichotomization is statistically superior to preserving a variable’s continuous representation. Furthermore, we used well controlled experiments to isolate asymmetrical reliability as the cause of dichotomization’s superiority across simulated conditions.

In conclusion, we have shown that a dichotomous representation of the Alda score for lithium responsiveness is more robust to noise arising from inter-rater disagreement. The dichotomous Alda score is therefore likely a better representation of lithium responsiveness for multi-site studies in which lithium response is a target or dependent variable. Through both re-analysis of the Alda score’s real-world inter-rater reliability data and careful theoretical simulations, we were able to show that asymmetrical reliability across the score’s domain was the likely cause for superiority of the dichotomous definition. Our study is not only important for future research on lithium response, but other studies using subjective and potentially unreliable measures as dependent variables. Practically speaking, our results suggest that it might be better to classify something we can all agree upon than to regress something upon which we can not.

Histograms of ratings for each value of the ground truth Alda A-score. This figure was generated identically to

(PDF)

Mutual information between gold standard and observed Alda A-scores in relation to observation noise and the scale’s “raw” or dichotomized form. This figure was generated identically to

(PDF)

Inter-rater reliability data for the total Alda score.

(CSV)

Inter-rater reliability data for the Alda A-score.

(CSV)

Mathematica notebook containing the empirical evaluation of the Alda Score of Lithium response. This notebook also contains additional analysis of the A-score alone.

(NB)

Mathematica notebook containing the theoretical analyses of discrete vs. continuous mutual information in asymmetrically reliable data.

(NB)

Jupyter notebook containing the theoretical analyses of the statistical power of classical associative tests under asymmetrically reliable data.

(IPYNB)

PDF version of

(PDF)

PDF version of

(PDF)

The authors wish to acknowledge those members of the Consortium on Lithium Genetics (ConLiGen) who contributed ratings for the vignettes herein: Mirko Manchia, Raffaella Ardau, Jean-Michel Aubry, Lena Backlund, Claudio E.M. Banzato, Bernhard T. Baune, Frank Bellivier, Susanne Bengesser, Clara Brichant-Petitjean, Elise Bui, Cynthia V. Calkin, Andrew Tai Ann Cheng, Caterina Chillotti, Scott Clark, Piotr M. Czerski, Clarissa Dantas, Maria Del Zompo, J. Raymond DePaulo, Bruno Etain, Peter Falkai, Louise Frisén, Mark A. Frye, Jan Fullerton, Sébastien Gard, Julie Garnham, Fernando S. Goes, Paul Grof, Oliver Gruber, Ryota Hashimoto, Joanna Hauser, Rebecca Hoban, Stéphane Jamain, Jean-Pierre Kahn, Layla Kassem, Tadafumi Kato, John R. Kelsoe, Sarah Kittel-Schneider, Sebastian Kliwicki, Po-Hsiu Kuo, Ichiro Kusumi, Gonzalo Laje, Catharina Lavebratt, Marion Leboyer, Susan G. Leckband, Carlos A. López Jaramillo, Mario Maj, Alain Malafosse, Lina Martinsson, Takuya Masui, Philip B. Mitchell, Frank Mondimore, Palmiero Monteleone, Audrey Nallet, Maria Neuner, Tomás Novák, Claire O’Donovan, Urban Ösby, Norio Ozaki, Roy H. Perlis, Andrea Pfennig, James B. Potash, Daniela Reich-Erkelenz, Andreas Reif, Eva Reininghaus, Sara Richardson, Janusz K. Rybakowski31, Martin Schalling, Peter R. Schofield, Oliver K. Schubert, Barbara Schweizer, Florian Seemüller, Maria Grigoroiu-Serbanescu, Giovanni Severino, Lisa R. Seymour, Claire Slaney, Jordan W. Smoller, Alessio Squassina, Thomas Stamm, Pavla Stopkova, Sarah K. Tighe, Alfonso Tortorella, Adam Wright, David Zilles, Michael Bauer, Marcella Rietschel, and Thomas G. Schulze.

