Abstract
Bland and Altman described approximate methods in 1986 and 1999 for calculating confidence limits for their 95% limits of agreement, approximations which assume large subject numbers. In this paper, these approximations are compared with exact confidence intervals calculated using two-sided tolerance intervals for a normal distribution. The approximations are compared in terms of the tolerance factors themselves but also in terms of the exact confidence limits and the exact limits of agreement coverage corresponding to the approximate confidence interval methods. Using similar methods the 50th percentile of the tolerance interval are compared with the k values of 1.96 and 2, which Bland and Altman used to define limits of agreements (i.e.
Keywords
1 Introduction
Bland–Altman Analysis is a group of descriptive statistical techniques, used for analysing the repeatability of measurements, or for comparing different measurement methods of the same clinical variable. The method was first formally elaborated by Altman and Bland in 1983, 1 but it was Bland and Altman’s paper in 1986 2 which is probably the most widely influential on the topic, with over 26,000 citations. 3 Their subsequent 1999 paper describes more advanced aspects of Bland–Altman analysis and, with over 3000 citations, is the most widely cited paper in Statistical Methods in Medical Research. 3
This paper investigates some of the statistical properties of 95% limits of agreement (95% LoAs) as used in Bland–Altman analysis. In particular, this paper assesses the confidence intervals for the LoAs, with emphasis on how robust the underlying assumptions are for small sample sizes. Bland and Altman defined 95% LoAs as
Thus, the 1986 and 1999 approximations for LoA confidence intervals may not be appropriate for small sample sizes. This is unfortunate, because such confidence intervals will be largest for the smallest sample sizes and are more likely to be of practical importance. The motivation for this paper is to examine this issue. If we define a 95% LoA as a symmetrical interval around the sample mean that contains 95% of the population, it is possible to precisely estimate confidence intervals for LoAs, based solely on the assumption that data are normally distributed, using two-sided tolerance factors for a normal distribution.6–11 Two-sided tolerance intervals address the question: for data drawn from a normally distributed population, given a sample with a mean
Although it hasn’t previously been described by others, the equations used for calculating two-sided tolerance factors can also be used to calculate exact γ and P values corresponding to Bland and Altman’s approximate confidence limits. In this paper, we use each of these measures to assess how well Bland and Altman’s approximate confidence limits for 95% LoAs match the exact confidence intervals calculated using two-sided tolerance intervals. We use similar techniques to investigate how well the definitions for LoAs (
This may then provide researchers with guidelines as to how acceptable different approximations for confidence intervals are, and at what sample size the approximations would approach acceptable levels.
2 Background on Bland–Altman analysis
Figure 1 illustrates a Bland–Altman plot for method comparison data taken from a study of refractive error in eyes.
12
The data are from 10 participants (n = 10) and show measurements of spherical equivalent refraction in Dioptres (D) made using two automated instruments: WR 5100K and ITrace. A typical Bland–Altman analysis has the differences d between the two instruments (in this case WR 5100K-ITrace) plotted on the y-axis and the mean xave of the two measurements plotted on the x-axis. The mean of differences Bland–Altman plot comparing two methods of measuring ocular refractive error, data taken from 10 subjects.
12
Error bars represent 95% confidence limits for LOAs calculated using exact two-sided tolerance factors
10
(asymmetric limits) and by Bland and Altman’s 1999 approximation (symmetrical limits).
4

Bland–Altman analysis typically involves the calculation of upper and lower 95% LoA. Bland and Altman’s 1986 paper gave the LoA as
Bland and Altman in 1986 acknowledged that the coefficient 2 was an approximation for 1.96, and most authors would use the slightly different definitions for LoA given by Bland and Altman (1999)
4
In Figure 1, equation (2) has been used to calculate the upper and lower LoAs. The upper LoA is −0.20 D +1.96×0.26 D = +0.31 D and the lower LoA is −0.20 D −1.96 × 0.26 D = −0.71 D.
The 95% LoAs are meant to represent the limits between which one would expect 95% of the inter-method differences to lie.2,13 It is likely that many researchers think of 95% LoAs in this fashion, but it is actually a simplification. LoAs are calculated using sample statistics
This issue is important because researchers and readers will use LoAs to assess whether two techniques (or repeated measures) match well enough to give the same measurement from a functional point of view. Some authors argue that researchers using Bland–Altman analysis should explicitly establish acceptable bounds for LoAs before conducting the experiment.14,15 But even if such a priori bounds are not established, readers are likely to interpret LoAs with their own, often unstated, opinions on what constitutes acceptable agreement. Such a judgement is difficult to make without an estimate of the potential uncertainty in LoA measurements.
As a means of addressing this issue, various authors have recommended calculating confidence intervals for LoAs.2,4,6,10,14–18 In this regard they are treating LoAs as most descriptive statistics should be treated (e.g. sample means, sample proportions, sample standard deviations).15,19
This was discussed in one of the very early Bland and Altman papers
2
which described an approach whereby confidence intervals for LoAs (95% CLLoA) are approximated by a t distribution
Bland and Altman (1999), in deriving this approximate method for LoA confidence limits, acknowledged that the approach assumes that n is not small. Likewise there have been other approximate methods for estimating LoA confidence limits17,20 which also rely on the same assumption. This is unfortunate because it is for small samples that confidence intervals are likely to be the largest.
But there is an approach suggested by some authors6,9–11,21,22 which can work for any sample size, provided the inter-method differences can be assumed to be drawn from a normally distributed population. This approach to calculating confidence limits for LoAs involves using two-sided tolerance factors for a normal distribution. In the case of Bland–Altman analysis the problem can be expressed as: given a sample mean (for Bland–Altman analysis,
Ludbrook
6
was the first to recommend this technique for outer confidence limits, using partial tables of two-sided tolerance factors derived from a very close approximation by Wald and Wolfowitz.
7
Other authors have also used approximate two-sided tolerance intervals as descriptors of Bland–Altman LoAs.
11
Carkeet
10
provided a more precise approach using an iterative method and the exact equations for two-sided tolerance factors24,25
Carkeet 10 provided tables of k values for γ values of 0.025, 0.05, 0.50, 0.95, 0.975. Although previous tolerance factor tables8,24–26 have been published for γ values of 0.50, 0.95, 0.975, Carkeet’s tables 10 for γ values of 0.025, 0.05 would appear to be unique in the literature, and they are necessary for calculating the inner confidence limits for LoAs.
By way of example, for the data in Figure 1 for n = 10 from Carkeet’s Table 2, 10 for γ = 0.025, the k value for n = 10 (i.e. ν = n−1 = 9) is 1.3915.
So the inner confidence bounds for LoAs can be calculated as
Outer confidence limits can also be calculated from Carkeet’s Table 2, 10 in which the appropriate k value for γ = 0.975 is 3.7706.
Using equation (10) the limits would be −0.20 D + 3.7706 × 0.26 D = +0.78 D and −0.20 D −3.7706 × 0.26 D =−1.18 D. These bounds are also shown in Figure 1.
To better compare the exact confidence limits found using two-sided tolerance factors with the Bland and Altman approximate confidence intervals, equations (4) and (6) and (10) can be used to calculate k values corresponding to Bland–Altman’s approximate confidence limits. For inner confidence limits they are
In addition, one can use two-sided tolerance intervals to evaluate Bland and Altman’s definitions of LoAs themselves. From Bland and Altman’s definitions 95% LoAs are
‘If the differences are normally distributed, we would expect 95% of differences to lie between
This would only be true if the sample statistics
How large this bias is can be assessed using two-sided tolerance intervals. The problem might be considered by setting k = 1.96 (or k = 2) and asking the question: With what confidence, γ, can we state that at least that 95% (P) of the population lies between the limits
Alternatively, one could use equations (7) to (9) to address the question, what value of k meets the criterion that at least 95% (P) of the population lies between the limits
Yet another way of assessing this problem is (again using equations (7) to (9)) by setting k = 1.96 or (k = 2) and asking what proportion P of the population can be expected to lie between the limits
All three of these measures are used below to evaluate the definitions of LoAs.
3 Calculations
3.1 Approximations for the outer confidence limits
Figure 2 shows, at different scales for clarity, how the k values for Bland and Altman’s approximate LoA outer confidence limits and the exact k values for two-sided tolerance factors (γ = 0.975) vary with sample size n, using equations (13) and (14). They also show k values calculated using exact two-sided tolerance factors based on equations (7) to (9). For low values of n the approximate method of Bland and Altman underestimate k, i.e. confidence limits actually lie further from k values for outer LoA confidence limits calculated by exact two-sided tolerance limits (P = 0.95, γ = 0.975) (solid line) and corresponding to the 1986 approximation (equation (13)) (dashed line) and the 1999 approximation (equation (14)) (dotted line). A and B are the same curves presented at different scales to highlight different aspects of the data.
Using equations (9) to (11), it is possible to calculate actual confidence values for the 1986 and 1999 approximate outer confidence limits shown in Figure 2. This is shown at different scales in Figure 3, for different values of n. The actual confidence for these limits should be γ = 0.975. For low values of n the confidence limits are too permissive, but even for n = 2 the actual confidence limits described by Bland–Altman approximations are γ = 0.895 (1999 approximation), and γ = 0.896 (1986 approximation). The confidence limits will again become too conservative (i.e. > 0.975) when n >= 40 for the 1986 approximation and n >= 76 for the 1999 approximation and will approach γ = 1 as n approaches infinity.
Confidence levels for the approximate outer confidence limit for 95% LoAs. The dotted line is for the 1986 approximation (k calculated using equation (13)) and the dashed line is for the 1999 approximation (k calculated using equation (14)). The solid line shows the goal confidence level of 0.975. A and B show the same curves to different scales.
Finally, we have calculated what probability coverage for LoA would correspond to the Bland–Altman approximations for outer confidence limits if γ = 0.975. This can be done using equations (9) to (11). The results are shown at different scales in Figure 4. For n = 2, the Bland–Altman approximations for 97.5% confidence limits apply for 32.6% LoAs (1986 approximation) and 32.2% LoAs (1999 approximation) but approximate LoAs exceed 90% coverage at n = 10 with 90.7% LoAs (1986 approximation) and 90.1% LoAs (1999 approximation). For γ =0.975, LoAs will be conservative (covering greater than 95% of the population) for n >= 40 (1986 approximation) and n >= 76 (1999 approximation). Maximum probability coverage is 95.21% for the 1999 approximation (n = 306) decreasing to 95.0004% as n approaches infinity. For the 1986 approximation, maximum probability coverage is 95.63% for the 1986 approximation (n = 326) decreasing to 95.45% as n approaches infinity.
LoA proportions covered for the outer (γ = 0.975) confidence limit for the 1986 approximation (dotted line) (k calculated using equation (13)) and for the 1999 approximation (dashed line) (k calculated using equation (14)). The solid line shows the goal LoA of 0.95 (i.e. 95% LoAs). A and B show the same curves to different scales.
3.2 Discussion: Outer confidence limit approximations for LoAs
The Bland and Altman approximations for 95% LoA outer confidence limits are reasonable for larger values of n, becoming slightly conservative for values of n >= 40 for the 1986 approximation and n >= 76 for the 1999 approximation. For smaller values than this, the approximations will be permissive, but whether the approximation is reasonable will be a matter of judgement for researchers and readers. As a guideline if n >= 22 for the 1986 approximation and if n >= 27 for the 1986 approximation, then the Bland–Altman estimates of outer confidence limits for LoAs would be conservative for 94% LoAs, and this may be an acceptable approximation for researchers. For n values less than approximately 10, then the Bland and Altman approximations for outer confidence limits for LoAs will be poor, and exact methods of estimating confidence intervals for LoAs may be preferable.
3.3 Approximations for the inner confidence limits
Bland and Altman’s approximations for the inner confidence intervals (γ = 0.025) for 95% LoAs have some unusual properties. For low values of n (<= 5) the approximations for k can be negative, but exact two-sided tolerance factors can never be negative, by definition. This is shown in Figure 5.
k values for inner LoA confidence limits calculated by exact two-sided tolerance limits (P = 0.95, γ = 0.0.25) (solid line) and for the 1986 approximation (equation (11)) (dashed line) and the 1999 approximation (equation (12)) (dotted line). A and B are the same curves presented at different scales.
This would give the result that the approximate inner confidence limit for the upper LoA would lie below
For sample sizes > 5, the approximate confidence limits will be too permissive (i.e. the confidence limits will be too close to
Even at moderate values for n there is a reasonable difference between k values obtained by approximate methods and exact methods. For example, when n = 30 for a 95% LoA with γ = 0.025, the exact value for k is 1.508, but the equivalent values for the Bland and Altman approximations are 1.3532 (1986 approximation) and 1.3219 (1999 approximation).
The actual confidence values for Bland and Altman’s approximate inner confidence intervals for 95% LoAs are shown in Figure 6 for values of n up to 1000. For n values of 5 or less, the actual confidence is zero (i.e. such LoAs can never occur in the population, because k < 0). Over the range shown (n =<1000), the confidence level increases but is much too permissive, for the 1999 approximation (e.g. γ = 0.0051, for n = 1000), while for the 1986 approximation confidence increases more rapidly with n so that at n = 489, confidence becomes greater than 0.025. Both approximations (1986 and 1999) will have a confidence (γ) which will approach unity as sample size approaches infinity.
Confidence levels for the approximate inner confidence limit for 95% LoAs. The dotted line is for the 1986 approximation (k calculated using equation (11)) and the dashed line is for the 1999 approximation (k calculated using equation (12)). The solid line shows the goal confidence level of 0.025. A and B show the same curves to different scales.
If one assesses the probability coverage for the LoA at γ = 0.025 for the Bland–Altman approximations (Figure 7), then for sample sizes <=5 coverage probabilities would have impossible values <0% (i.e. <0% LoAs) and are not plotted. At n = 6 for γ = 0.025 then the approximate inner confidence limits from Bland and Altman would correspond to 20% LoAs for the 1999 approximation and 21.9% LoAs for the 1986 approximations. For moderate sample sizes the coverage will still be too permissive, e.g. for n = 30 the Bland–Altman approximate inner bounds would correspond to γ = 0.025 for 89.9% LoAs (1999 approximation) and 90.7% LoAs (1986 approximation).
LoA proportions covered for the inner (γ = 0.025) confidence limit for the 1986 approximation (dotted line) (k calculated using equation (11)) and for the 1999 approximation (dashed line) (k calculated using equation (12)). The solid line shows the goal LoA of 0.95 (i.e. 95% LoAs). A and B show the same curves at different scales.
Eventually, as n increases, the approximate confidence limits for LoAs will become conservative (>95%) so that as n approaches infinity the Bland–Altman approximate inner bounds would correspond to γ = 0.025 for 95.0004% LoAs (1999 approximation) and 95.45% LoAs (1986 approximation).
3.4 Discussion: Inner confidence limit approximations for LoAs
Compared with outer confidence limits, Bland and Altman’s approximations for LoA inner confidence limits are a poorer approximation for the exact inner confidence limit based on two-sided tolerance intervals. This reflects the asymmetry of two-sided tolerance intervals, especially for low values of k. Its probability density function will be positively skewed with a long tail for high values of k. But low values of k are limited to values larger than zero, and this section of the probability density function is compressed. Bland and Altman’s approximations give negative k values for sample sizes <=5, which are difficult to find a meaningful interpretation for.
But even for larger sample sizes the approximations might not be adequate. As a guide, Bland and Altman’s inner confidence intervals would be appropriate (γ = 0.025) for 94% LoAs when n =>103 for the 1986 approximation and when n=>181 for the 1999 approximation. The 1986 Bland and Altman approximations for inner confidence intervals will be permissive for values of n < 490. For all practical sample sizes, the 1999 Bland and Altman approximation for inner confidence intervals will be too permissive, but for very large sample sizes (not shown in Figures 5 to 7) the 1999 approximation for inner confidence intervals will be conservative. This occurs because the 1.96 coefficient is not an exact approximation (slightly large) for the 97.5 percentile of the normal distribution. The exact transition sample size, where the 1999 approximation becomes conservative, is so large that it is difficult to calculate exactly with our software, but it is larger than 308,000,000 and smaller than 310,000,000.
Given that the Bland and Altman approximations for inner confidence intervals for 95% LoAs are permissive for sample sizes that are large in a practical sense, researchers may prefer the approach of using the exact two-sided tolerance factors to calculate the inner confidence limits.
3.4.1 LoA as estimates
Figure 8 shows k values (for P = 0.95 and γ = 0.50) for different sample sizes (Carkeet’s
10
Table 2). For small sample sizes these k values are significantly larger than the LoA coefficients of 1.96 and 2 used in the Bland–Altman approximations.2,4 By way of example for n = 2, the smallest sample size, k = 3.3756; while for n = 10, k = 2.1239. The k values decrease as n increases, becoming less than the Bland–Altman approximation of 2 for sample sizes greater than 40, and becoming less than the Bland–Altman approximation of 1.96 for sample sizes of greater than 45,349.
k values for the median of the LoA confidence interval calculated by exact two-sided tolerance limits (P = 0.95, γ = 0.50) (solid line) and for the 1986 LoA (k = 2) (dashed line) and the 1999 LoA (k = 1.96) (dotted line).
If k values are fixed at 2 or 1.96, and P = 0.95 (for 95% LoAs), then confidence levels γ can be calculated for different values of n, using equations (7) to (9). These γ values are shown, plotted against n, in Figure 9. For the k value of 2, confidence values are less than 0.50 for n values of less than 40, i.e. too permissive. But for the more commonly used approximation of k = 1.96 confidence values are too permissive for sample sizes that might not be considered small from a practical point of view. For example, confidence is less than 0.45 for n values less than 85; when n = 1000 confidence is 0.4855. For k = 1.96 confidence levels finally become greater than 0.5 when n > 45,349.
Confidence levels for the Bland–Altman 95% LoAs. The dotted line is for the 1986 approximation (k = 2) and the dashed line is for the 1999 approximation (k = 1.96). The solid line shows the 0.5 confidence level. A and B show the same curves to different scales.
If, confidence levels are set at γ = 0.50, and k values are fixed at 2 or 1.96 then P values can be calculated for different values of n, using equations (7) to (9). These P values are shown, plotted against n, in Figure 10. Such P values are permissive for small sample sizes but P coverage may be acceptable for some researchers. For example for n = 2, P = 0.7358 (i.e. 73.58% LoAs) for k = 2 and P = 0.7255 (i.e. 72.55% LoAs) for k = 1.96. For n = 5, P becomes greater than 0.9 (e.g. wider than 90% LoAs) for k = 2 and for k = 1.96, and P becomes greater than 0.94 (i.e. wider than 94% LoAs) for n >= 14 for k = 2, and for n >= 20 for k = 1.96.
LoA proportions covered for the median of the confidence interval (γ = 0.5) for the 1986 approximation (dotted line) (k = 2) and for the 1999 approximation (dashed line) (k = 1.96). The solid line shows the goal LoA of 0.95 (i.e. 95% LoAs). A and B show the same curves at different scales.
3.5 Discussion on LoA approximations
LoAs as defined by Bland and Altman cannot exactly summarize the possible range that contains 95% of the differences in the population. One can however describe a confidence interval for k values such that
Adopting a confidence of 50% as our criterion, Bland and Altman’s definitions of 95% LoAs as
4 General discussion
The theoretical framework behind two-sided tolerance factors is well established and appears to be appropriate for considering Bland–Altman 95% LoAs. The basic data available are a sample mean
Why then have approximate methods of describing confidence intervals for LoAs been predominantly used and described in the existing literature? One reason may be the availability of tables of two-sided tolerance factors. Approximate tables based on Wald and Wolfowitz method have been available since just after the Second World War7,27 but it wasn’t until the early 1980s that extensive tables became widely available8,26 based on exact computational methods.24,25 It was not until 2015 that tables were published for two-sided tolerance factors for γ = 0.025 (and γ = 0.05) 10 which are useful for determining inner confidence limits for LoAs. Thus, Bland and Altman were using the tools which were available to address the problem at that time.2,4
A second issue may be one of the conceptual framework on which confidence intervals are based. Two-sided tolerance factors are used for addressing what portion of a population lies between
A third issue may be that confidence intervals for LoAs are almost never reported in the literature. This is despite the fact that Bland and Altman first described how to calculate approximate confidence limits as early as 1986. 2 In a review of papers in one journal, such confidence limits had been reported in only 0.7% of papers using Bland–Altman analysis. 10 Another literature review of laboratory research papers reporting Bland–Altman Analysis published later than 2012 found that 6% of these reported confidence intervals on their LoAs. 15 Thus, researchers may not be used to thinking about LoA confidence intervals, exact or approximate, and there may be little incentive to develop different methods of calculating the confidence intervals.
Although it is not currently common practice, we think that reporting confidence intervals for LoAs should be a standard component of Bland–Altman analysis. It is important because the LoAs are used for comparison with clinical acceptable criteria for acceptance. It is good practice to establish such criteria and state them a priori, and a recent review found this done in 74% of papers assessed. 15 But even if such criteria are not made explicit by authors, readers are likely to have their own reference criteria for acceptability. Researchers and readers should make this comparison with an understanding of the intrinsic uncertainty of LoAs, and confidence limits are a way of doing this. Providing such confidence intervals is considered good practice for many other statistics, such as sample means, odds ratios and risk ratios5,19,28 and it is reasonable to treat LoAs in the same way. We note that other authors have also suggested that researchers should include a calculation of confidence limits for LoAs.2,6,14–16,18,29–31 These authors have usually recommended the approximate methods of Bland and Altman2,4 (but not all 6 ). Our results show that even for relatively large sample sizes the approximate method will not be sufficiently accurate.
This will be especially true for inner confidence limits, in which the approximation will give confidence limits on the wrong side of
We note that, based on similar reasoning, one might also question the usual definition of LoAs as
Finally, it is important to acknowledge that the calculation of Bland–Altman LoAs has an underlying assumption that data are drawn from a population with a normal distribution. This assumption also underlies the two-sided tolerance factors used above. If this assumption is incorrect for a data set then it is unlikely that 95% of a population will be contained between the LoAs (or described by the tolerance factors with a given confidence). Unfortunately, for small sample sizes it is difficult to assess those assumptions and so, like a lot of parametric statistics, 5 the LoAs should be viewed with some caution. Unfortunately, non-parametric estimates of 95% LoAs require larger sample sizes to achieve a reasonable level of confidence.
This may be less of a problem for Bland–Altman analysis because it assesses differences between methods, which will tend to be distributed in approximately normal distributions.2,4 However, researchers should be sensitive to theoretical issues in their data (e.g. sampling artefacts, or floor and ceiling effects) which may lead to the differences not being normally distributed. The robustness of Wald and Wolfowitz’s approximate two-sided tolerance limits 7 has been assessed by Canavos and Koutrouvelis, 32 using gamma and t-distributions as the parent populations for Monte Carlo studies. They found two-sided tolerance factors were robust for P values of 0.90 unless distributions are very skewed or kurtotic, but this robustness failed for P = 0.95 (the P values which would apply for 95% LoAs). Canavos and Koutrouvelis did not test robustness for confidence levels of γ = 0.025 and γ = 0.975, or γ = 0.50, and to be directly applicable to Bland–Altman LoAs it is worth doing a more complete robustness study using these confidence limits and the exact two-sided tolerance factors for P = 0.95.
5 Conclusions
If researchers are content that their data are normally distributed then Bland–Altman LoAs should be accompanied by confidence limits and in our view, especially for smaller samples, the more conservative exact two-sided tolerance factors should be used for their calculation, rather than Bland and Altman’s approximate method.
Footnotes
Acknowledgements
The first author thanks Kevin Hanley for his helpful instruction on the integration techniques used in computing values from equations (7) to (
).
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
