Abstract
In view of the importance placed on the first intermetatarsal angle in the assessment of surgical intervention in hallux valgus, we assessed the reliability with which one measures this angle. The study involved 10 observers of varying experience measuring the angle using a standard technique on 10 weightbearing AP X-rays of the foot on three separate occasions. The margin of error in measuring the angle was ±3.60° with a 95% confidence interval. Increasing and averaging the number of readings per observer or the readings of a number of observers, reduces the error. Experience doesn't improve reliability. In conclusion, improvement in the reliability of the measurements can be achieved by careful technique, performing the measurements at least twice, and averaging them.
INTRODUCTION
Various angles are used in an attempt to classify the severity of a hallux valgus deformity. Postoperatively, these angles serve as one of the objective means for assessing the success of surgical correction.
The first intermetatarsal angle is commonly used as a component of the measurement of severity of hallux valgus. This angle has been used to classify hallux valgus into three categories: mild, moderate and severe (Table 1). It is generally agreed that a first intermetatarsal angle of less than 10° is considered as normal. 1,3,4,9 The difference in the angle between the upper limits of normal and the lower limit of a severe deformity is only 7° (9°–16°). Thus accurate reproducible measurements are required to classify the deformity correctly.
Classification of hallux valgus on an AP weightbearing X-ray
All measurements are subject to error, and therefore knowledge of the amount of the inherent error allows better clinical decision-making.
The objectives of the study were fourfold:
to estimate the inter- and intra-observer variation in making this measurement;
to compare the observers in terms of their measurement reliability;
to assess the effect of experience; and
to explore procedures to reduce the variability and hence improve measurement reliability.
MATERIALS AND METHODS
Ten observers measured 10 angles on three separate occasions, giving a total of 300 measurements. X-rays with a wide range of angles were selected for this study.
Each observer was given the 10 X-rays to read once per week for three weeks. The order of the X-rays was randomized to prevent recall bias but was the same for each observer. Five of the observers were less experienced than the other five in making these measurements. The observers were told the purpose of the study and shown a standardized technique 12,13 to measure the angle (Fig. 1), at the beginning of the study and again each time subsequently. This technique involves identifying accurately the longitudinal axes of the first and second metatarsal bones on standard weightbearing antero-posterior X-ray views of the foot, using a light source, a transparency and an indelible pen. The angle subtended by these two lines is then measured with either a goniometer or a protractor and recorded.

Demonstration of the standardized technique used in the study to measure the first intermetatarsal angle.
The ‘experienced’ five observers had all measured this angle in this way before, but they also were reminded how to measure it, and all the observers had a reference example on display each time if required. The same equipment was available to all the observers and adequate time was provided for the measurement. The ‘inexperienced’ group comprised two Senior House Officers in Orthopaedics (SHO), a theatre nurse, a medical student, and an anaesthetics registrar. The ‘experienced’ group were all orthopaedic registrars.
The analysis was carried out principally by analysis of variance, using mainly the statistical software JMP (version 3). 5
RESULTS
Components of Variance
All 10 observers were included in the initial analysis, comprising a total of 300 individual angle measurements. This resulted in the following values of the three components of the total variance, all measured in degrees squared: inter-X-ray (15.2), inter-observer (0.348), and intra-observer (3.36). The inter-X-ray variance was very large, as expected, given the deliberate diversity in the angles as part of the study design. The effect of training made no contribution to the variation.
When the outlying observers, that is the least and most reliable (see further below), were excluded, the three components of variance (based now on eight observers and 240 measurements) were respectively 16.13, 0.359, and 2.86° squared. There is a consequent reduction in the intra-observer component of variance, but it is not large because the two outlying observers had opposing effects on this component. The standard error of measurement per X-ray (SEM) is given by the square root of the total observer variance, i.e. SEM = (0.359+2.86) = 1.794, and thus the margin of error per measurement is approximately twice that, i.e. 3.6°. This means that for any given measurement there is approximately a 95% chance that the true measurement is contained in the range of the observed angle ±3.6°.
Comparison of Observers
The total variation of each observer's 30 measurements can be broken down into two components of variation, inter-X-ray and intra-observer. His/her reliability (also termed precision or repeatability) is defined as the proportion of the total variation of his measurements due to his intra-observer error. Reliability expresses an observer's ability to discriminate among X-rays, and it can range from 0 (no discrimination, intra-observer variation masks all inter-X-ray variation) to 1 (all measurements perfectly consistent, perfect discrimination). An acceptable value of reliability is recommended to be at least 0.85. 14 Observers can thus be usefully compared by their reliabilities, and these are shown, together with the components of variance (Table 2). The two outlying observers, 3 and 4, are shown in italics.
Components of variance and reliability by observer

Measurements for observers 3 and 4.
The reliability of observer 4, for example, is poor not only due to her high intra-observer variance as compared with the other observers, but because it is high compared with her inter-X-ray variance. The converse applies to observer 3, who achieved extremely high reliability. The notion of reliability can also be expressed in terms of the additional information gained through measurement over pure guessing. For example, observer 4, who had the lowest reliability, improved her discrimination by only 20% over pure guessing, whereas observer 3 improved his discrimination by 84%.
To illustrate these points, the measurements for observers 3 (left) and 4 (right) are shown in Figure 2.
This figure displays three dots denoting the three readings per X-ray for each observer, superimposed by vertical boxes to enhance readability. The top and bottom of each box correspond to the highest and lowest reading respectively, and the line inside the box corresponds to the middle reading. Where the middle line is missing, two readings are coincidental. Three coincidental readings did not occur. The horizontal line in each plot denotes the grand mean of all 30 measurements for that observer. The two parts of the figure demonstrate clearly why observer 4 is less successful in discriminating X-rays than observer 3. For example six of the latter's X-ray ranges of readings crossed her grand mean, as contrasted with only one for observer 3.
Observers 3 and 4 achieved the best and worst reliability of all the observers, respectively. The five orthopaedic registrars ranged from 0.83 to 0.91, so even the worst of them was just about acceptable by the criteria for reliability presented earlier. One of the two Senior House Officers had substantially worse reliability than the rest of the doctors, but the other SHO, by contrast, was second best. The anaesthetic registrar scored in about the middle of the range of the orthopaedic registrars. These results suggest that factors other than experience may be more influential in determining the reliability of these measurements.
Permutations of measurements, SEM, and margin of error

The number of measurements and the probability of misclassification The legend shows the permutations of measurements represented by the various plots. For example a = 1, n = 1 represents a single measurement by one observer, a = 1, n = 2 two measurements by one observer, a = 2, n = 1 one measurement each by two observers, etc.
The analysis excluding observers three and four was adopted by us as more valid because not only do these two observers represent extreme cases of measurement reliability, but also because neither represents the population of interest: experienced orthopedic surgeons.
Measurement Strategy
The standard error of measurement, SEM, introduced in section 1 above, can be reduced if an X-ray is read repeatedly, either by the same observer or by different observers, and the results averaged. The total observer variance per averaged measurement when several measurements are made is given by the expression Var = Va/a+Vn/an, where Va and Vn are the inter- and average intra-observer components of variance, respectively, while a and n are the number of observers and the number of repeat measurements per observer respectively.
SEM is Var as before. The formula shows that, since Vn (2.86) is so much larger than Va (0.359), and it alone is affected by repeat measurements per observer, increasing those rather than the number of observers, is more efficient in reducing the total variance.
Some permutations of the numbers of observers, and repeat measurements per observer, and the resulting SEM and margin of error, are shown in Table 3. This table shows the very large reductions in SEM that can be made by increasing the number of measurements from 1 to just 2, (shown in italics), and the progressively diminishing benefits with further increases. It also shows that if a very small SEM is needed, requiring a large number of observations, then it becomes efficient to engage several observers.
The margin of error can be used to calculate approximately the probability of misclassification of an X-ray into an incorrect category from normal through mild and moderate deformity to severe deformity (Table 1). This has obvious implications for clinical decision-making. Figure 3 shows the probabilities of misclassifying X-rays for selected measurement strategies, based on this study. For example, with only one measurement, the measured value needs to be at least 3.590 from a defined cut-off, in order for the misclassification risk to be less than 5%. With one repeat measurement this offset reduces to 2.68°. In practice, however, these measurements need to be rounded off to the nearest half degree, the maximum resolution of the measurement instrument (goniometer/protractor) used in the study.
The figure shows again the rapidly diminishing benefit, so that even on averaging the readings of 10 observers, an offset of 1° at 5% error risk is not achieved. When a very small error risk is required, necessitating a large number of repeat measurements, it is more efficient to use more than one observer. This is apparent from the figure which shows that six observers with one measurement each discriminate better than one observer making 10 repeats.
DISCUSSION
The measurement of the first intermetatarsal angle is determined using the long axes of the first and second metatarsal bones. 13 The configuration of the metatarsals, may make it difficult to decide on two mid-diaphyseal points, and lead to error. In an effort to reduce this error, some authors use the centre of the articular surface of the metatarsal head and the centre of the proximal articulation to draw the long axis of the bone. 9,12
Our results show that the intra-observer measurement error is very much larger than the inter-observer error. This may seem surprising, since the latter is potentially influenced by variations of the observers' measurement technique, experience, and ability to read X-rays. The inter-observer error expresses, however, systematic differences among the observers, that is the extent to which some observers were reading the angles as consistently bigger (or smaller) than the other observers. Our result makes sense in the context of this particular measurement, especially following the procedures used to standardize the measurement technique in this study. Resch et al. found a higher inter-observer error. 10 However their definition and calculation of this error was different from ours, and after making relevant adjustments to the calculations, the two results converged quite closely. Their intra-observer error was similar to that found in our study. That the inter-observer error sub-sumes the intra-observer error is often not appreciated, as for example by Saltzman et al. 11 in their analysis of multiple angle measurements in the normal foot.
The observers in our study not only did not show systematic differences, but were also very similar in their reliabilities. These results suggest that the measurement of the first intermetatarsal angle has low intrinsic measurement reliability, largely independent of characteristics of the observers. One aspect of this is the instrument itself (i.e., the goniometer or protractor), capable of resolution to at most 0.5°, which is a substantial proportion of the margin of error, at 3.6° per reading.
By our analysis we were able to combine the intra- and inter-observer errors into a total measurement error, since in practice the two cannot be dissociated, and hence to model the effects of alternative measurement strategies involving several measurements and observers on the margins of error and the probabilities of misclassification of a reading leading to decisions requiring one type of surgery over another, and in evaluating the effect of surgery.
Our results demonstrate the considerable reduction in measurement error, and hence the improvement in reliability, that can be achieved by making even only one repeat reading, and progressively lower benefits by increasing and averaging the number of readings further.
The repeat readings can be made either by the same observer or by different observers. Although the latter approach reduces the margin of error more, the benefit is small in practice, especially when the total number of readings is small, because of the dominant size of the intra-observer error. Hence it is effectively immaterial whether the repeat readings are made by the same or by different observers. One benefit of the latter approach is avoidance of possible recall bias. One simple strategy might be an initial reading by two observers followed by a repeat by both in the event of a wide discrepancy.
The major issue may well be the maximum resolution capabilities of the instruments used in the study and explicit evaluation of these instruments is required to identify their contribution to the measurement error. A controlled study, incorporating in the study design prototypes of instruments with a higher resolution, may answer this question.
We have shown that considerable improvement in the reliability of the measurements can be achieved by exercising care in the measurement technique, performing the measurements at least twice, and averaging the measurements.
