^{1}

^{1}

^{2}

^{2}

^{3}

^{1}

^{4}

^{5}

^{2}

^{2}

^{2}

^{1}

^{5}

The authors have declared that no competing interests exist.

The Bland-Altman limits of agreement method is widely used to assess how well the measurements produced by two raters, devices or systems agree with each other. However, mixed effects versions of the method which take into account multiple sources of variability are less well described in the literature. We address the practical challenges of applying mixed effects limits of agreement to the comparison of several devices to measure respiratory rate in patients with chronic obstructive pulmonary disease (COPD).

Respiratory rate was measured in 21 people with a range of severity of COPD. Participants were asked to perform eleven different activities representative of daily life during a laboratory-based standardised protocol of 57 minutes. A mixed effects limits of agreement method was used to assess the agreement of five commercially available monitors (Camera, Photoplethysmography (PPG), Impedance, Accelerometer, and Chest-band) with the current gold standard device for measuring respiratory rate.

Results produced using mixed effects limits of agreement were compared to results from a fixed effects method based on analysis of variance (ANOVA) and were found to be similar. The Accelerometer and Chest-band devices produced the narrowest limits of agreement (-8.63 to 4.27 and -9.99 to 6.80 respectively) with mean bias -2.18 and -1.60 breaths per minute. These devices also had the lowest within-participant and overall standard deviations (3.23 and 3.29 for Accelerometer and 4.17 and 4.28 for Chest-band respectively).

The mixed effects limits of agreement analysis enabled us to answer the question of which devices showed the strongest agreement with the gold standard device with respect to measuring respiratory rates. In particular, the estimated within-participant and overall standard deviations of the differences, which are easily obtainable from the mixed effects model results, gave a clear indication that the Accelerometer and Chest-band devices performed best.

The Bland-Altman method of limits of agreement is a well-established method of analysing continuous data to assess how well the measurements produced by two raters or devices agree with each other to the extent that they could be used interchangeably without causing any practical problems.[

Approximately 328 million people are estimated to be living with COPD worldwide,[

The COPD Respiratory Rate study was approved by the South East Scotland Research Ethics Committee (references: 13/SS/0114, 13/SS/0206 and 14/SS/0043). Participants gave written informed consent to take part in the study. Respiratory rate was measured in 21 people with a range of severity of COPD living in Scotland using five different monitors: described as Camera, Photoplethysmography (PPG), Impedance, Accelerometer, and Chest-band, according to their mode of action. The measurements were made simultaneously with participants wearing all five monitors at the same time in addition to the current gold standard device. It was thus entirely reasonable and valid for us to make direct comparisons between devices because they were all measured at exactly the same time against the same gold standard device. Participants were asked to perform eleven different activities chosen to be representative of everyday tasks during a laboratory-based standardised protocol of 57 minutes of activities. These were sitting, lying, standing, slow walking, fast walking, sweeping, lifting objects, standing and walking, climbing stairs, treadmill (flat walking), and treadmill (4% slope). The activities were designed to test the devices across the full range of plausible measurements. Not everyone performed exactly the same number of activities because some tasks (e.g. the treadmill task) were too difficult for some participants. Also, the number of valid observations per device varied because some devices experienced more technical problems than others, and some were better than others at capturing respiratory rates during the standardised protocol. Therefore, this is an example of a completely unbalanced study design. Furthermore, activity was a potential source of variability in addition to participants and devices. This study design necessitated the use of an advanced form of limits of agreement analysis that takes into account multiple sources of variability and a completely unbalanced study design.

The procedure to calculate limits of agreement involves first calculating the mean and standard deviation of the paired differences (e.g. differences in respiratory rate measured at the same time in the same participant using two different devices). The standard deviation is then multiplied by the 97.5% quantile of a normal distribution (usually rounded to 2) and we then separately add or subtract this quantity from the calculated mean to give the upper or lower limits respectively. If

Limits of agreement methodology can be applied regardless of whether one of the devices or raters is a gold-standard. However, a necessary assumption of the method is that the observations are independent. When this assumption is violated, for example when we have multiple values recorded per individual, then it is necessary to use a repeated measures version of the method [

The 2007 article by Bland & Altman offers a step-by-step guide for applying the methodology in the case of multiple values per individual.[

To calculate the mixed effects limits of agreement, we analysed the paired differences of each device compared with the gold-standard using a mixed effects regression model, including participant as a random effect and activity as a fixed effect, using the nlme package [_{ij} represents the _{i} is the random effect of the _{k} is the fixed effect of the _{ij} is the error for paired difference

However, to generate an appropriately weighted estimate of the mean bias we fitted a separate regression model only including a constant term and random effect for participant (i.e. without adjusting for activity). Using the same notation as above, this random effects model was of the form:

In calculating the mixed effects limits of agreement we assumed that participants were a representative random sample from the overall population of COPD patients, and that the random participant effects _{i} were normally distributed. We also assumed constant mean bias _{ij} that were independent and normally distributed with a constant variance. It is important to check these assumptions since any violation of the assumptions may lead to biased variance components. Bland-Altman plots provide a quick visual check of these assumptions.[

Further checks of the model assumptions were conducting using (i) plots of the standardized residuals against fitted values, (ii) Q-Q plots of the residuals, and (iii) Q-Q plots of the random effect predictions. There was some evidence of violation of assumptions for all devices except for Impedance. However, after removing clear outliers in the model residuals, model assumptions were found to be valid for all device comparisons.

Ninety-five per cent confidence intervals were constructed around the standard deviations using a parametric bootstrap-t (studentized pivotal) procedure [

For comparative purposes, the resulting limits-of-agreement were compared to those obtained from a fixed effects approach based on an extension of the ANOVA method presented in Bland & Altman[

A total of 21 participants were recruited, but one participant recorded no observations on two devices (see

Mean bias and limits of agreement are shown by the dashed lines, while confidence intervals are shown by the dotted lines. (A) Camera: rate per second. (B) Camera: rate per minute. (C) PPG: raw. (D) PPG: median filtered.

Mean bias and limits of agreement are shown by the dashed lines, while confidence intervals are shown by the dotted lines. (A) Impedance. (B) Accelerometer. (C) Chest-band.

Device | No. of participants(total valid differences/ total after removing outliers) | Mean bias (Fixed effects 95% LoA) | Mean bias (Mixed effects 95% LoA) | Mean bias (REVISED mixed effects 95% LoA |
---|---|---|---|---|

Camera (rate per second) | 21 (192/188) | -3.32 (-13.35 to 6.72) | -3.21 (-12.71 to 6.30) | -3.15 (-11.54 to 5.24) |

Camera (rate per minute) | 21 (192/188) | -4.43 (-15.54 to 6.69) | -4.35 (-14.98 to 6.28) | -4.43 (-13.13 to 4.26) |

PPG (raw) | 21 (378/377) | 3.53 (-10.30 to 17.35) | 3.53 (-10.30 to 17.35) | 3.46 (-10.06 to 16.98) |

PPG (median filtered) | 21 (378/376) | 3.01 (-11.16 to 17.17) | 3.01 (-11.17 to 17.19) | 3.02 (-10.53 to 16.57) |

Impedance | 20 (304/304) | -1.17 (-20.07 to 17.73) | -1.18 (-20.07 to 17.72) | -1.18 (-20.07 to 17.72) |

Accelerometer | 20 (284/282) | -2.18 (-8.74 to 4.38) | -2.18 (-8.63 to 4.27) | -2.14 (-7.91 to 3.63) |

Chest-band | 21 (385/384) | -1.61 (-9.99 to 6.78) | -1.60 (-9.99 to 6.80) | -1.67 (-9.64 to 6.30) |

Mean bias = Average difference; LoA = Limits of Agreement.

*Outliers removed and emphasis is placed on model fitting rather than consistency in methods across devices.

Device | Original Mixed Effects Model | Revised Mixed Effects Model |
||||||
---|---|---|---|---|---|---|---|---|

Within- participant SD | Combined SD | Maximum negative difference | Maximum positive difference | Within- participant SD | Combined SD | Maximum negative difference | Maximum positive difference | |

Camera (rate per second) | 4.25(3.87 to 4.79) | 4.85(4.44 to 5.76) | 35.00 | 14.71 | 3.92 (3.58 to 4.44) | 4.28 (3.93 to 4.94) | 35.00 | 9.65 |

Camera (rate per minute) | 5.18 (4.74 to 5.86) | 5.42 (4.97 to 6.15) | 35.00 | 10.63 | 4.31 (3.95 to 4.89) | 4.44 (4.07 to 4.97) | 35.00 | 10.63 |

PPG (raw) | 6.02 (5.64 to 6.53) | 7.05 (6.61 to 8.60) | 18.66 | 30.19 | 5.85(5.46 to 6.37) | 6.90 (6.45 to 8.26) | 18.66 | 23.44 |

PPG (median filtered) | 6.20 (5.81 to 6.77) | 7.23 (6.76 to 8.63) | 25.00 | 32.39 | 5.84 (5.46 to 6.36) | 6.91 (6.44 to 8.29) | 22.00 | 23.65 |

Impedance | 8.94 (8.29 to 9.84) | 9.64 (8.97 to 10.91) | 32.00 | 34.00 | 8.94(8.29 to 9.84) | 9.64(8.97 to 10.91) | 32.00 | 34.00 |

Accelerometer | 3.23 (3.00 to 3.59) | 3.29 (3.05 to 3.62) | 24.84 | 12.99 | 2.85(2.64 to 3.15) | 2.94(2.73 to 3.25) | 14.82 | 12.99 |

Chest-band | 4.17(3.91 to 4.53) | 4.28 (4.02 to 4.65) | 24.65 | 27.80 | 3.98(3.73 to 4.33) | 4.07(3.81 to 4.42) | 24.65 | 20.76 |

*Outliers removed and emphasis is given to model fitting rather than consistency in methods across devices.

Device | Mean bias | 95% repeated differences LoA | 95% bootstrap CI of lower limit | 95% bootstrap CI of upper limit |
---|---|---|---|---|

Camera (rate per second) | -3.21 | -12.71 to 6.30 | -14.84 to -11.42 | 5.00 to 8.39 |

Camera (rate per minute) | -4.35 | -14.98 to 6.28 | -16.70 to -13.65 | 4.96 to 8.00 |

PPG (raw) | 3.53 | -10.30 to 17.35 | -14.23 to -8.58 | 15.58 to 21.16 |

PPG (median filtered) | 3.01 | -11.17 to 17.19 | -15.10 to -9.39 | 15.35 to 20.79 |

Impedance |
-1.18 | -20.07 to 17.72 | -23.84 to -17.81 | 15.49 to 21.65 |

Accelerometer | -2.18 | -8.63 to 4.27 | -9.45 to -7.96 | 3.62 to 5.21 |

Chest-band` | -1.60 | -9.99 to 6.80 | -11.04 to -9.19 | 6.05 to 7.86 |

95% confidence intervals were calculated using a parametric bootstrap-t method based on 1999 resamples.

*No outliers

Two devices (Accelerometer and Chest-band) were regarded as having “acceptable” agreement with the gold standard device because their corresponding limits of agreement were within +/- 10 breaths per minute, although ideally for a high level of agreement we were hoping that the limits would fall within +/- 5 breaths per minute. In any case, these two devices showed the narrowest limits of agreement, and this conclusion was robust to the inclusion or exclusion of outliers. This allowed us to select these two devices for further assessment, which involved testing the acceptability and reliability of the devices in a home setting (see Rubio

The Camera and Impedance devices returned 11 and 37 zero observations respectively. It could be argued that since a respiratory rate of zero is impossible, zero values should not be included in the analysis. However, these were real values returned by the devices, and we included them in order to take a conservative “intention-to-treat” approach which is consistent with the use of these devices in real life. As a sensitivity analysis, re-running the mixed effects analysis on all the data with zeros removed, our results were very similar to before and conclusions unchanged (see

This article shows how mixed-effects limits of agreement analysis can be applied relatively easily to the comparison of different devices even when there may be multiple or complex sources of variation in the study design. We compared these limits to a fixed effects approach based on Bland & Altman’s true value varies method and the results were similar. Advantages of the mixed effects approach include the potential for stronger inference and greater generalisability of the results to the target population.[

For the mixed effects analysis we directly used the paired differences comparing the devices with the gold standard rather than the raw responses recorded on each device. Carstensen and colleagues [

Myles and Cui [_{i} with constant variance

Unlike in standard prediction intervals, the standard error of the mean bias is not included in the calculation of the limits of agreement. Instead, confidence intervals around the mean bias separately quantify the uncertainty in this estimate just as they do around each of the limits of agreement. In fact, there is no requirement for the mean to be derived from exactly the same model as is used to compute the limits of agreement. Olofsen and colleagues suggest that either the raw (or grand) mean or the mean of the participant-level means could be computed.[

On the other hand, if the number of multiple respiratory rate readings within each participant and activity was the same (i.e. if the problem was completely balanced), then it would not have mattered how we calculated the mean bias. This is because any fixed or random effects estimate of the mean would be equivalent to the raw mean due to equal weighting of observations in this context.[

Hofman and colleagues[

In limits of agreement analysis we must assume a constant level of agreement across the range of measurement. In some of the Bland-Altman plots shown in Figs

The reason behind outliers in the model residuals or zero values returned by the devices was unknown, but we believe it may have been due to technical issues with some of the devices or problems with device fitting. This was why it was appropriate for us to include results with and without outliers; otherwise if we excluded all outliers and zero values this may give a false impression of the agreement for some devices. Only a few of the outliers in the models residuals can be attributed to zero values produced by the devices; and it could be that most of the other outliers were caused by inaccurate readings by the devices but these may be difficult to detect from simply looking at the raw data.

In our mixed effects model we considered activities as a fixed effect. Instead, we could have considered the activities as random effects, and this would have enabled us to immediately obtain an estimate of the variability between activities, but the fixed effects assumption for the activities made more sense in this context than assuming the activities were a random sample from a larger population.

The MOVER method developed by Zou [

To encourage future applications of the mixed effects limits of agreement method, the full R code we used is provided in the

In this article, we showed how the mixed effects limits of agreement method was ideally suited to answer the question of which device had the strongest agreement with the gold standard with respect to measuring respiratory rates in COPD patients. The superiority of the limits of agreement method over alternatives such as calculating correlation coefficients has been discussed elsewhere.[

(DOCX)

(DOCX)

(XLSX)

(DOCX)

(DOCX)

We wish to thank the anonymous reviewer for their very helpful comments which have greatly improved our paper.