Conceived and designed the experiments: RZ AB RI NAI. Performed the experiments: RZ AB RI NAI. Analyzed the data: RZ AB RI NAI. Contributed reagents/materials/analysis tools: RZ AB RI NAI. Wrote the paper: RZ AB RI NAI.
The authors have declared that no competing interests exist.
Accurate values are a must in medicine. An important parameter in determining the quality of a medical instrument is agreement with a gold standard. Various statistical methods have been used to test for agreement. Some of these methods have been shown to be inappropriate. This can result in misleading conclusions about the validity of an instrument. The BlandAltman method is the most popular method judging by the many citations of the article proposing this method. However, the number of citations does not necessarily mean that this method has been applied in agreement research. No previous study has been conducted to look into this. This is the first systematic review to identify statistical methods used to test for agreement of medical instruments. The proportion of various statistical methods found in this review will also reflect the proportion of medical instruments that have been validated using those particular methods in current clinical practice.
Five electronic databases were searched between 2007 and 2009 to look for agreement studies. A total of 3,260 titles were initially identified. Only 412 titles were potentially related, and finally 210 fitted the inclusion criteria. The BlandAltman method is the most popular method with 178 (85%) studies having used this method, followed by the correlation coefficient (27%) and means comparison (18%). Some of the inappropriate methods highlighted by Altman and Bland since the 1980s are still in use.
This study finds that the BlandAltman method is the most popular method used in agreement research. There are still inappropriate applications of statistical methods in some studies. It is important for a clinician or medical researcher to be aware of this issue because misleading conclusions from inappropriate analyses will jeopardize the quality of the evidence, which in turn will influence quality of care given to patients in the future.
Most important variables in medicine are measured in numerical forms or continuous data, such as blood pressure, glucose level and oxygen level. In any clinical situation, we are expected to have accurate readings of these variables. Numerous new techniques or tools have been developed with the aim of finding a cheaper, noninvasive, more convenient and safer method to test patients. It is important to be sure that the new tool or method of measurement is as accurate as the current or gold standard method. Therefore it is important to measure the agreement of the new method with the standard method. Agreement signifies the accuracy of that certain instrument
Various statistical methods have been used to test for agreement of medical instruments with quantitative or continuous outcomes
Bland and Altman proposed a method for the analysis of agreement (BlandAltman plot and limits of agreement) in 1983
The purpose of this study is to review statistical methods used to assess agreement of medical instruments measuring the same continuous variable in the medical literature. The proportion of various statistical methods found in this review will also reflect the proportion of medical instruments that have been validated using those particular statistical methods in current clinical practice.
This review follows the reporting standards as suggested in the PRISMA statement; see PRISMA
Any method comparison studies assessing the agreement of medical instruments or equipment. Only the agreement of continuous variables will be considered. The instruments must be applicable for use in humans.
In 2010, we searched Medline, Ovid, PubMed, Scopus and Science Direct for studies investigating the agreement of instruments or equipments in medicine published in journals between January 2007 and December 2009. Boolean search was performed on each database using the search term: Agreement AND (validation OR “comparison study”). The search was limited to the medical field (including dentistry), studies involving human subjects, and articles in the English language.
All citations identified from the search were downloaded into the EndNote X1 software. The citations were organized and duplicates were identified and deleted. We excluded any studies with qualitative or categorical data, studies with different units of outcomes, and association studies. Unpublished articles were not considered in this review. Study selection was conducted by two independent researchers. There was no disagreement between the two reviewers at the stage of study selection.
We extracted characteristics from each article based on the year of publication and journal type. We categorized journal types into five areas: medicine (including obstetrics, gynecology, emergency and critical care medicine), surgery, radiology, nutrition and others.
We collected information on the statistical methods used to assess agreement from the methodology section or the statistical analysis section, and also by identifying which statistical methods influenced the author’s conclusion on the agreement.
Data extraction was performed by two researchers independently. Most of the time, the two researchers agreed with the outcomes. In any case of disagreement, agreement was reached by consensus, and a third reviewer assisted when consensus could not be reached.
Descriptive analysis of the characteristics of studies and statistical methods used were performed. This is a descriptive review, and all results are displayed as percentages. Data were analyzed using the SPSS 15.0 software.
A total of 3,260 titles were initially identified, and after filtering for duplicates 3,134 records (titles and abstracts) were screened. Only 412 titles were potentially related and 285 fulltext report records were reviewed. Seventyfive articles did not meet the inclusion criteria, and a total of 210 articles were finally included in this review.
Out of the 210 articles reviewed, 70 were published in 2007, 70 in 2008, and 70 in 2009. Eightyeight (42%) of the articles were obtained from the Science Direct database, 51 (24%) from the Medline database, 48 (23%) from the Scopus database, and 23 (11%) from the PubMed database. Most of the studies (72 or 34%) were published in medical journals, 30 (14%) in nutritionrelated journals, 29 (14%) in radiology journals, and 28 (13%) in surgical journals.
Overall, 117 articles (56%) used a single method to assess agreement while 93 articles (44%) used multiple (two or more) methods. The most popular statistical methods used to assess agreement in the 210 reviewed articles and according to specialty are summarized in
Statistical Method Used  Number of articles using the method, x (%) n = 210 
1. BlandAltman Limits of Agreement2. Correlation coefficient (r)3. Compare means/Significant test4. Intraclass Correlation Coefficient5. Compare slopes or/and intercepts  178 (85%)58 (28%)38 (18%)14 (7%)13 (6%) 
Area of specialty  Statistical Method Used  Number of articles using the method (x) 

1. BlandAltman Limits of Agreement2. Correlation coefficient (r)3. Compare slopes or/and intercepts4. Intraclass Correlation Coefficient5. Compare means/Significant test  246432 

1. BlandAltman Limits of Agreement2. Correlation coefficient (r)3. Compare means/Significant test4. Intraclass Correlation Coefficient5. Percentage of error  218541 

1. BlandAltman Limits of Agreement2. Correlation coefficient (r)3. Compare means/Significant test4. Intraclass Correlation Coefficient5. Compare slopes/intercepts  266632 

1. BlandAltman Limits of Agreement2. Correlation coefficient (r)3. Coefficient of determination (r^{2})4. Compare means/Significant test5. Compare slopes or/and intercepts  2513444 
Twenty articles
Study objective  Results & author’s conclusion  
Ten 2007 
To compare four different commercial activated partial thromboplastin time (aPTT) reagents to detect shortened aPTT.  Correlation coefficients among the four methods ranged from 0.51 to 0.83 (all P values <0.001). Acceptable agreement between the different commercial reagents was found with respect to detection of short aPTT. Good agreement were found between Instrumentation Laboratory and bioMerieux reagents (r = 0.74–0.83) 
Reis 2007 
To validate a method for the quantification ofthe very low levels of urinary human chorionic gonadotropin (hCG).  Equation from regression analysis: y = 0.99x+8.55, Correlation coefficient of 0.993 demonstrates very good immunoassay accuracy for the studied range of hCG concentrations. 
Satia 2007 
To assess the degree of agreement betweenthree instruments of measuring dietaryfat consumption.  Pearson’s correlation coefficients among the three methods ranged from 0.18 to 0.58(all P values <0.0001). There was good concordance among the three methods. 
Mündermann 2008 
To compare three dimensional position capture with skin markers and radiographicmeasurement for measuringmechanical axis alignment.  The mechanical axis alignment from position capture correlated well with thegold standard of measurement using radiographs (R^{2} = 0.544 P<0.001). Theproposed method allows the measurement of the mechanical axis alignmentwithout exposure to radiation. 
Anderst 2009 
To compare the beadbased method oftracking bone motion in vivo with themodelbased method.  Agreement between the two systems was quantified by comparing bias (mean difference). All bias measures not significantly different from zero. The newmodelbased tracking achieves excellent accuracy without the necessityfor invasive bead implantation. 
Naidu 2009 
To evaluate the validity of the Hand Assessment Tool (HAT) and Disabilities of Arm, Shoulder,and Hand Questionnaire (DASH).  Strong positive correlation between DASH and HAT (r = 0.91). The HAT mayserve as useful alternative to the DASH. 
Our study is the first systematic review on this topic. This study provides evidence that the BlandAltman method (limits of agreement) is the most popular method that has been used to measure agreement. The majority (85%) of agreement studies in this review have applied the BlandAltman method to assess agreement, with more than half (56%) of them using only the BlandAltman method (i.e. without any combination with other method). Our study shows that there still inappropriate applications of statistical methods to assess agreement in the medical literature.
Bland and Altman introduced the limits of agreement to quantify agreement way back in 1983
In this review, the correlation coefficient was also found to be a statistical method used to measure agreement. Correlation Coefficient (r) reflects the noises and direction of linear relationship
Reading  A  B  C (twice of B) 
12345678910  10.208.208.709.609.608.209.407.006.6010.80  10.208.008.059.709.058.158.806.556.5510.50  20.4016.0016.1019.4018.1016.3017.6013.1013.1021.00 
Some people proceed to regression analysis as an extension to correlation analysis to answer their question of agreement. They use the coefficient of determination (r^{2}) as a measure of agreement. Again, this is inappropriate because coefficient of determination (r^{2}), being related to the correlation coefficient relies on a similar concept and is thus not suitable for assessing agreement. Coefficient of determination (r^{2}) is used to state the proportion of variance in the dependent variables that is explained by the regression equation or model
The third most popular method found in this review is comparing means of readings from two instruments. Paired ttest is usually used to test the significant differences between the means of two sets of data, to assess the agreement
Another method that was used to assess agreement found in this review is the intraclass correlation coefficient. The intraclass correlation coefficient (ICC) was initially devised to assess reliability
Often in testing for agreement, the gradient of the regression line of two variables is tested against one
The proportion of various statistical methods found in this review probably reflects the proportion of medical instruments that have been validated using those particular statistical methods in current clinical practice. Almost all methods have received criticism, including the BlandAltman method. However, correlation coefficient, coefficient of determination, regression coefficient, and comparing means are obviously inappropriate to assess agreement. Although Altman and Bland have been highlighting the issue of inappropriateness of these statistical methods in method comparison studies since the 1980s, some of these methods were still in use in the studies which we reviewed. This study found that 20 (10%) of reviewed articles have used only these inappropriate methods to assess agreement. The equipment which has been tested using these methods may not be valid, and consequently may produce inaccurate readings. It makes uncomfortable reading that as many as one out of ten supposedly validated instruments currently used in clinical practice may not be accurate. This has the potential to affect the management of patients, quality of care given to the patient, and worse still could cost lives.
In 2009, Essack et al.
It is imperative that all medical instruments are accurate and precise. Otherwise, a failure in this regard may lead to critical medical errors. Therefore there is a necessity for proper evaluations of all medical instruments, and it is important to be sure that the appropriate statistical method has been used. The inappropriate application of statistical methods in the analysis of agreement is cause for concern in the medical field and cannot be ignored. It is important for medical researchers and clinicians from all specialties to be aware of this issue because inappropriate statistical analyses will lead to inappropriate conclusions, thus jeopardizing the quality of the evidence, which may in turn influence the quality of care given to the patient.
Of the 210 reviewed articles, only six studies were coauthored by someone working in a statistics or biostatistics department. Other studies did not state whether any assistance was sought from a statistician. One of the six studies have used correlation coefficient and comparing means to study agreement, whereas the other five studies have used the BlandAltman method (either singly or in combination with another method). Medical researchers might need to consider assistance from a statistician in analyzing data from agreement studies. This could potentially reduce errors in data analysis, avoiding the use of inappropriate methods and improve the interpretation of results in their studies.
Recently, the guidelines for reporting reliability and agreement studies (GRRAS) have been proposed
This systematic review has several strengths. This is the first study specifically designed to retrieve information on statistical methods used to test for agreement of instruments measuring the same continuous variable in the medical literature. This study also provides supporting evidence that confirms the anecdotal claim that the BlandAltman method is the most popular method used to assess agreement. A broad search term was used, in order to capture the largest possible number of publications on this topic. We also tried to reduce bias by using two independent reviewers during the selection of articles and data extraction. However, the results of this study may have limited generalizability due to selection bias. This review was limited to five electronic databases (Medline, Ovid, PubMed, Science Direct and Scopus) and limited to articles published only in English. The search was only performed using online databases, and as such, unpublished articles were not considered. However, these databases have a very wide coverage of published medical journals including high quality and high impact journals.
In conclusion, various statistical methods have been used to measure agreement in validation studies. This study concludes that the BlandAltman method is the most popular method that has been used to assess agreement between medical instruments measuring continuous variables. There were also some inappropriate applications of statistical methods to assess agreement found in recent medical literature. It is important for the clinician and medical researcher to be aware of this issue because erroneous and misleading conclusions from inappropriate statistical analyses may lead to the application of inaccurate instruments in clinical practice. The issue of inappropriate analyses in agreement studies needs to be highlighted to prevent repetition of the same mistake by future researchers.
(PDF)