^{1}

^{*}

^{2}

Conceived and designed the experiments: TAF. Performed the experiments: TAF SL. Analyzed the data: TAF SL. Contributed reagents/materials/analysis tools: TAF SL. Wrote the paper: TAF SL.

In the last three years TAF has received honoraria for speaking at CME meetings sponsored by Astellas, Dai-Nippon Sumitomo, Eli Lilly, GlaxoSmithKline, Janssen, Kyorin, MDS, Meiji, Otsuka, Pfizer, Shionogi and Yoshitomi. He is on the advisory board for Sekisui Chemicals and Takeda Science Foundation. He has received royalties from Igaku-Shoin, Seiwa-Shoten, Nihon Bunka Kagaku-sha and American Psychiatric Publishing. SL has received fees for consulting and/or lectures from Actelion, Bristol-Myers Squibb, Sanofi-Aventis, Eli Lilly, Essex Pharma, AstraZeneca, GlaxoSmithKline, Janssen/Johnson & Johnson, Lundbeck, and grant support from Eli Lilly. This does not alter the authors' adherence to all the PLoS ONE policies on sharing data and materials.

In the literature we find many indices of size of treatment effect (effect size: ES). The preferred index of treatment effect in evidence-based medicine is the number needed to treat (NNT), while the most common one in the medical literature is Cohen's d when the outcome is continuous. There is confusion about how to convert Cohen's d into NNT.

We conducted meta-analyses of individual patient data from 10 randomized controlled trials of second generation antipsychotics for schizophrenia (n = 4278) to produce Cohen's d and NNTs for various definitions of response, using cutoffs of 10% through 90% reduction on the symptom severity scale. These actual NNTs were compared with NNTs calculated from Cohen's d according to two proposed methods in the literature (Kraemer, et al.,

NNTs from Kraemer's method overlapped with the actual NNTs in 56%, while those based on Furukawa's method fell within the observed ranges of NNTs in 97% of the examined instances. For various definitions of response corresponding with 10% through 70% symptom reduction where we observed a non-small number of responders, the degree of agreement for the former method was at a chance level (ANOVA ICC of 0.12, p = 0.22) but that for the latter method was ANOVA ICC of 0.86 (95%CI: 0.55 to 0.95, p<0.01).

Furukawa's method allows more accurate prediction of NNTs from Cohen's d. Kraemer's method gives a wrong impression that NNT is constant for a given d even when the event rate differs.

When a clinician and a patient jointly decide on a treatment, they need to know how much the treatment in question is better than an alternative treatment and in what respect. Effect size (ES) is an index, a single number preferably, that expresses this HOW MUCH.

Clinical decision-making is facilitated by consideration of the difference in risk of important beneficial (e.g. remission of an episode) or adverse (e.g. suicide) events or the reciprocal of this risk difference, the number needed to treat (NNT)

When the outcome is continuous, however, the most common summary ES index in the medical literature is Cohen's d

Recently Kraemer and Kupfer

Furukawa's method and Kraemer's method to convert Cohen's d into NNT are therefore at odds with each other. This paper aims to empirically examine and compare these two approaches, based on the individual patient data of randomized controlled trials of second generation antipsychotics in the acute phase treatment of patients with schizophrenia.

Individual patient data from 10 trials comparing olanzapine vs placebo (2 comparisons, baseline n = 502)

Study | Antipsychotic drugs and daily dosage (mg) | Sample size (n) | Mean BPRS at baseline |

Beasley et al 1996 |
Olanzapine 10Placebo | 5050 | 55.2 |

Beasley et al 1996 |
Olanzapine 10–15Haloperidol 15Placebo | 1336968 | 59.9 |

Beasley et al 1997 |
Olanzapine 10–15Haloperidol 15 | 17581 | 59.1 |

Tollefson et al 1997 |
Olanzapine 5–20Haloperidol 5–20 | 1337659 | 51.5 |

Lieberman et al 2003 |
Olanzapine 5–20Haloperidol 2–20 | 131132 | 46.8 |

Keefe et al 2006 |
Olanzapine 5–20Haloperidol 2–19 | 15997 | 48.4 |

Möller et al 1997 |
Amisulpride 600–800Haloperidol 15–20 | 9596 | 61.7 |

Puech et al 1998 |
Amisulpride 400–1200Haloperidol 16 | 19464 | 61.3 |

Colonna et al 2000 |
Amisulpride 200–800Haloperidol 5–20 | 368118 | 56.2 |

Carrière et al 2000 |
Amisulpride 400–1200Haloperidol 10–30 | 97105 | 65.4 |

All studies were randomized and all but one

For fixed-dose studies, we selected only those arms with optimum doses of second-generation antipsychotic drugs as reported in dose-finding studies (amisulpride 400–800 mg/day, olanzapine 10–20 mg/day and risperidone 4–6 mg/day)

The mean BPRS total score of the included participants was 54.3 (SD = 10.8) at baseline. There were 2895 men and 1383 women. Their mean age was 36.6 (10.5) years, weight 75.5 (16.4) kg and height 171.6 (9.6) cm.

We first conducted meta-analyses of the BPRS or PANSS total score at 4 weeks for the three comparisons of olanzapine vs haloperidol, amisulpride vs haloperidol and olanzapine vs placebo, using Review Manager software by the Cochrane Collaboration

We next calculated the numbers of responders defined as 10% through 90% reduction on the BPRS or PANSS total score at 4 weeks. The percentage reduction was calculated according to the formulae: B% = (B_{0}−B_{4LOCF}) * 100/(B_{0}−18) for BPRS and P% = (P_{0}−P_{4LOCF}) * 100/(P_{0}−30) for PANSS, where B_{0} and P_{0} are BPRS and PANSS scores at baseline and B_{4} and P_{4} are respective scores at 4 weeks, because 18 and 30 are the minimum scores for BPRS and PANSS, respectively, according to the original rating system. We then ran meta-analyses of response rates defined as 10% through 90% reduction for each comparison in terms of risk difference. The pooled NNT was obtained by taking the inverse of this pooled risk difference, because the response rates for a certain cutoff did not differ substantively among the trials included in the meta-analysis,

These actual NNTs were then compared with NNTs converted from Cohen's d according to Kraemer's method and to Furukawa's method using the formulae discussed in the

No statistical heterogeneity was observed for any of the meta-analytic summaries.

Olanzapine vs placebo (BPRS), d = 0.34 | ||||

Definition of response | CER | Actual NNT | Kraemer's method | Furukawa's method |

10% | 0.42 | 5.9 (3.4 to 20) | 5.3 | 7.4 |

20% | 0.35 | 7.1 (4.0 to 50) | 5.3 | 7.6 |

30% | 0.26 | 6.7 (4.0 to 25) | 5.3 | 8.3 |

40% | 0.21 | 9.1 (4.5 to 100) | 5.3 | 9.1 |

50% | 0.16 | 11.1 (5.3 to ∞) | 5.3 | 10.4 |

60% | 0.11 | 16.7 (−∞ to −50, 7.7 to ∞) | 5.3 | 12.9 |

70% | 0.06 | 25.0 (−∞ to −50, 9.1 to ∞) | 5.3 | 19.1 |

80% | 0.04 | −100 (−∞ to −17, 25 to ∞) | 5.3 | 25.5 |

90% | 0.01 | −100 (−∞ to −25, 50 to ∞) | 5.3 | 74.1 |

Olanzapine vs haloperidol (BPRS), d = 0.17 | ||||

Definition of response | CER | Actual NNT (95%CI) | Kraemer's method | Furukawa's method |

10% | 0.64 | 12.5 (9.1 to 25) | 10.5 | 16.3 |

20% | 0.52 | 11.1 (7.7 to 16.7) | 10.5 | 14.9 |

30% | 0.40 | 11.1 (7.7 to 16.7) | 10.5 | 15.0 |

40% | 0.29 | 12.5 (9.1 to 25) | 10.5 | 16.5 |

50% | 0.19 | 14.3 (10 to 25) | 10.5 | 20.2 |

60% | 0.11 | 25.0 (14.3 to 50) | 10.5 | 28.3 |

70% | 0.06 | 33.3 (20 to 100) | 10.5 | 43.4 |

80% | 0.02 | 33.3 (25 to 100) | 10.5 | 102.0 |

90% | 0.005 | 100 (50 to ∞) | 10.5 | 326.0 |

Olanzapine vs haloperidol (PANSS), d = 0.17 | ||||

Definition of response | CER | Actual NNT | Kraemer's method | Furukawa's method |

10% | 0.61 | 12.5 (8.3 to 25) | 10.5 | 15.8 |

20% | 0.47 | 11.1 (7.7 to 20) | 10.5 | 14.8 |

30% | 0.34 | 11.1 (7.7 to 20) | 10.5 | 15.6 |

40% | 0.23 | 14.3 (10 to 25) | 10.5 | 18.2 |

50% | 0.15 | 20.0 (14.3 to 50) | 10.5 | 23.2 |

60% | 0.09 | 33.3 (20 to 100) | 10.5 | 33.6 |

70% | 0.04 | 58.8 (28 to 200) | 10.5 | 62.4 |

80% | 0.01 | 47.6 (31 to 100) | 10.5 | 162.7 |

90% | 0.004 | 125 (71 to ∞) | 10.5 | 384.5 |

Amisulpride vs haloperidol (BPRS), d = 0.21 | ||||

Definition of response | CER | Actual NNT | Kraemer's method | Furukawa's method |

10% | 0.78 | 16.7 (9.1 to 100) | 8.5 | 17.5 |

20% | 0.67 | 10.0 (6.3 to 20) | 8.5 | 13.9 |

30% | 0.57 | 9.1 (5.9 to 20) | 8.5 | 12.4 |

40% | 0.49 | 10.0 (5.9 to 25) | 8.5 | 12.0 |

50% | 0.38 | 7.7 (5.3 to 14.3) | 8.5 | 12.2 |

60% | 0.26 | 9.1 (5.9 to 20) | 8.5 | 13.8 |

70% | 0.17 | 14.3 (8.3 to 50) | 8.5 | 17.1 |

80% | 0.09 | 25.0 (12.5 to ∞) | 8.5 | 25.6 |

90% | 0.02 | 33.3 (20 to 100) | 8.5 | 79.3 |

The ANOVA ICC of absolute agreement between the actual NNT and those estimated by Kraemer's method was 0.06 (−0.34 to 0.43, p = 0.39) and that for Furukawa's method was 0.33 (−0.01 to 0.62, p = 0.03). When the response is defined at thresholds as high as 80% or 90% reduction, the CER becomes extremely low and the NNT may be considered degenerate with negative numbers and with 95% confidence intervals extending to infinity. We therefore calculated the ANOVA ICC for the ranges from 10% through 70% reduction where we observed relatively constant OR for these different definitions of response

Each meta-analysis comparing olanzapine vs placebo, olanzapine vs haloperidol, and amisulpride vs haloperidol produces a single Cohen's d. This single effect size was converted into NNTs according to Kraemer's method and Furukawa's method, and compared with the actual NNTs using various cutoffs to define response. NNTs from Kraemer's method overlapped with the observed NNT in 56% of the examined instances but the degree of agreement was at a chance level (ANOVA ICC of 0.12, p = 0.22 at best). Those based on Furukawa's method fell within the observed plausible ranges of NNTs in 97% of the instances and the degree of agreement was ANOVA ICC of 0.86 (0.55 to 0.95, p<0.01) for various definitions of response corresponding with 10% through 70% reduction on the rating scale where we expect to observe a non-small number of responders.

The reason for this difference in performance is that the latter method takes into account the fact that, for a given d on a continuous outcome measure, the response rate can vary depending on the cutoff one adopts to define response. This individualized consideration in assessing clinical importance of Cohen's d is extremely important. For example, d of olanzapine over haloperidol in the acute phase treatment of schizophrenia is approximately 0.17. On the other hand, olanzapine causes more significant weight gain than haloperidol, with an NNH estimated to be around 6 (95%CI: 4–11)

Converting Cohen'd into NNT is also very important when we argue at the population level. For example, Cohen's d of 0.2 is usually regarded as small effect

One possible drawback of Furukawa's method is that it requires estimation of control event rate in order to predict NNTs accurately. However we argue that this is more of a strength than a weakness of this method, because this is what EBM practitioners normally do when they apply group-level evidence to individuals

Conversely one can argue that the reason why Kraemer's method turned out to be less efficient is because they subtly re-defined NNT for a continuous outcome as the inverse of the difference between the probability that a patient in the treatment has an outcome preferable to one in the control and the probability that a patient in the control has an outcome preferable to one in the treatment. This definition is slightly different from the conventional definition of NNT in EBM

The interpretation of a quantified effect size is inherently difficult and variable