^{1}

^{*}

^{2}

^{3}

^{4}

^{2}

^{2}

^{5}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: SV RP AEP. Performed the experiments: SV. Analyzed the data: SV RP PA. Contributed reagents/materials/analysis tools: SV AEP. Wrote the manuscript: SV RP TF AEP PA. Critically reviewed submission: TF RP.

The use of standardised tools is an essential component of evidence-based practice. Reliance on standardised tools places demands on clinicians to understand their properties, strengths, and weaknesses, in order to interpret results and make clinical decisions. This paper makes a case for clinicians to consider measurement error (ME) indices Coefficient of Repeatability (CR) or the Smallest Real Difference (SRD) over relative reliability coefficients like the Pearson’s (r) and the Intraclass Correlation Coefficient (ICC), while selecting tools to measure change and inferring change as true. The authors present statistical methods that are part of the current approach to evaluate test–retest reliability of assessment tools and outcome measurements. Selected examples from a previous test–retest study are used to elucidate the added advantages of knowledge of the ME of an assessment tool in clinical decision making. The CR is computed in the same units as the assessment tool and sets the boundary of the minimal detectable true change that can be measured by the tool.

Reliability refers to the reproducibility of measurements [

Perfect test–retest reliability scores are rare, as all instruments respond with some error. Thus, any observed score (O) can be assumed to have a true score (T) and an error component (E) [O = T ± E] [

Test–retest reliability is concerned with the repeatability of observations made on or by individuals. It is assumed that O is an accurate measurement of T. When a standardised tool is used to measure an outcome, clinicians rely on the published test–retest reliability coefficient of the tool to guide the confidence in their results.

Test–retest reliability can be estimated using relative and absolute indices [

The main drawback of Pearson’s (r) value is that it does not provide clinicians with any insight into systematic errors that may be inherent in the measurement obtained with a specific assessment tool. For example, as shown in the hypothetical data set in

Although it is still not a measure of absolute agreement, the Intraclass correlation coefficient (ICC) is often reported in place of the Pearson’s (

Unlike Pearson’s (

Absolute reliability is concerned with variability due to random error [

The ‘repeatability coefficient’ (CR) also referred to as the Smallest Real Difference (SRD) is a useful index that quantifies absolute reliability ME in the same units as the measurement tool [

[

[

The CR is the value below which the absolute differences between two measurements would lie with 0.95 probability [_{W}) or the Standard Error of Measurement (SEM) by 2.77 (√ 2 times 1.96). Thus, CR = 2.77S_{w} [

To further test the case of the CR over Pearson’s

A ‘4-week’ test–retest design was used. The secondary level student version of the SSRS-SSF was administered to 187 year 7 students (Mean age = 12 years 3 months,

This study is based on secondary data analysis of a prior submission entitled Internal consistency, test–retest reliability and measurement error of the self-report version of the Social Skills Rating System in a sample of Australian adolescents [

Analyses were undertaken using SPSS version17 and SAS Version 9.2 software packages. Test–retest reliability estimates, such as the Pearson’s correlation coefficient (_{(2,1)} and the CR were computed using standard formulae [

For purposes of illustration, retest estimates from the empathy subscale for girls and assertion subscale for boys are discussed. These subscales were chosen for purposes of graphical emphasis, as participants’ mean scores differed significantly across administrations.

As shown in

[

_{(2,1)} |
_{diff} between subject |
|||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

M | 84 | 13.24 | 3.11 | 13.90 | 3.10 | 0.89 | 0.78 | 0.77 | 0.66 | 17.20 | 2.90 | 0.005 | -3.4 (-4.1 to -2.6) | 4.7 (3.9 to 5.5) | 2.30 | ±1.52 | ±4.21 | |

F | 74 | 12.86 | 3.07 | 13.27 | 3.07 | 0.84 | 0.72 | 0.72 | 0.40 | 16.22 | 1.52 | 0.13 | -4.1 (-5.0 to- 3.2) | 4.9 (4.0 to 5.8) | 2.66 | ±1.63 | ±4.52 | |

M | 98 | 14.44 | 2.95 | 13.95 | 3.06 | 0.78 | 0.62 | 0.62 | -0.49 | 14.64 | -1.86 | 0.06 | -5.6 (-6.5 to -4.7) | 4.6 (3.7 to 5.5) | 3.49 | ±1.87 | ±5.18 | |

F | 92 | 16.66 | 1.93 | 16.27 | 2.04 | 0.71 | 0.54 | 0.53 | -0.38 | 6.07 | -1.89 | 0.06 | -4.1 (-4.8 to -3.4) | 3.4 (2.7 to 4.1) | 1.89 | ±1.37 | ±3.81 |

ICC_{2, 1} Intraclass correlation coefficient: two-way random effect model (absolute agreement definition)

95% LOA LB (95% CI of the LOA) = Bland and Altman 95% Limits of agreement Lower Boundary (95% Confidence intervals of the limits of agreement)

95% LOA UB (95% CI of the LOA) = Bland and Altman 95% Limits of agreement Upper Boundary (95% Confidence intervals of the limits of agreement)

CR = 2.77 × SEM [

As shown in

As presented in

Staying with the above example, if one looks closely at the SD of the empathy and assertion subscales, we note that girls’ empathy scores were less spread around the mean (M_{1}= 16.66, SD_{1}= 1.93; M_{2}= 16.27, SD_{2}= 2.04), when compared to boys’ assertion scores (M_{1}= 13.24, SD_{1}= 3.11; M_{2}= 13.90, SD_{2}= 3.10). So, as a group, girls’ scored more homogeneously, than boys did on assertion behaviours (more heterogeneous). The wider spread of boys’ scores on the assertion subscale resulted in the magnitude of Pearson’s (r) and ICC being greater [

We computed the coefficient of repeatability (CR) or the Smallest Real Difference (SRD) to index the measurement error or the smallest possible change in subscale and total social skills scale scores that represents true/real change [

The statistically significant difference in mean score as a marker of change (as measured by the

To date, there exists no consensus on what the acceptable value of a correlation coefficient ought to be to inform tool selection [

Unlike relative reliability indices, to date there is no formulaic approach to benchmark ME. This means that there exists no statistical method to decide whether a ME of ± 4.21 in relation to the range of scores on the assertion subscale (assertion subscale range = 0 - 20 units) is wide or small. Thus, although ME sets the boundaries of the minimal delectable true change of an outcome measure, it holds limited clinical importance beyond that function.

ME helps clinicians decide on a best practice level whether the observed change in a client’s performance is true [

It is vital that the ME of an outcome tool be corroborated against its MCID before clinicians decide to use the tool to measure change in an intervention study. For example, Schuling et al. [