Abstract
In personality research, trait–outcome associations are often studied by correlating scale sum scores with an outcome. For example, an association between the NEO Impulsiveness scale and body mass index (BMI) is often interpreted to pertain to underlying trait Impulsiveness. We propose that this expectation can be corroborated by testing for Spearman's theorem of indifference of indicator. Namely, an underlying trait–outcome association should not depend on the specific items (i.e. indicators) used to measure the trait. To test this theorem, we outline an indicator exclusion procedure and demonstrate its viability using a simulation design. We then apply this procedure to test personality–BMI associations for indifference of indicator in a large population–based sample of adult Estonians (N = 2581) using self–ratings and informant ratings obtained with the NEO Personality Inventory–3. Our results show that the N5: Impulsiveness–BMI association mostly depends on two eating–related items, suggesting that the trait associated with BMI may be narrower than the trait the N5: Impulsiveness scale is supposed to measure. Associations between BMI, E3: Assertiveness and C2: Order seem to pertain to the trait. In sum, testing for indifference of indicator provides a potentially useful method to clarify trait–outcome relationships. R scripts are provided that implement the indicator exclusion procedure. Copyright © 2015 European Association of Personality Psychology
Introduction
A common method to understand human behaviour is relating various outcomes of interest to personality traits (Ozer & Benet–Martínez, 2006; Roberts, Kuncel, Shiner, Caspi, & Goldberg, 2007). In such analyses, a trait is often operationalised as sum scores of a personality scale consisting of individual items. These individual items of a scale have been chosen to best represent and quantify an ostensibly underlying latent trait—the true personality attribute. Correlations of the scale sum scores with outcomes of interest are then interpreted as pertaining to the underlying trait. However, we argue that such interpretations may sometimes merit more careful consideration, particularly if traits are indeed conceptualised as latent variables.
In latent variable framework, the true variance of a personality trait is manifested in the common variance of scale's items (further described later). If so, to the extent that a correlation between a scale's sum score and an outcome reflects the correlation of the outcome with an underlying trait, the association should be independent of how the underlying trait is measured. In particular, the association should not depend on which items happen to be sampled for the measurement of the trait. Otherwise, the association would not pertain to the underlying trait, but to the specific characteristics reflected in the particular items driving the observed correlation. This may sound like a trivial assumption, but it is often neglected in practice.
For example, the N5: Impulsiveness facet of the NEO Personality Inventory (NEO–PI–R) has been related to body mass index (BMI) in several studies (Sutin, Costa et al., 2013; Sutin, Ferrucci, Zonderman, & Terracciano, 2011; Sutin et al., 2015; Terracciano et al., 2009). However, not all N5: Impulsiveness items relate to BMI. Terracciano et al. (2009) specify that
the Impulsiveness scale effect [on BMI] was due to two items specific to the eating domain (‘When I am having my favorite food, I tend to eat too much’ and ‘I sometimes eat myself sick’) (p. 685).
It should be noted that the same logic has already been applied in the context of the Big Five domains and their facets. In particular, there is ample evidence that different facets of the same Big Five domain scales correlate with outcomes in different ways (e.g. Costa & McCrae, 1995; Judge, Rodell, Klinger, Simon, & Crawford, 2013). Therefore, facets reflecting the same trait should relate to outcomes in similar ways if the associations pertain to the underlying domain–level trait. If they do not, then the associations pertain to facet(s) and not to the domain–level trait. This logic is commonly accepted as can be seen in Sutin et al. (2011): Although they demonstrate that BMI relates to trait Neuroticism, they appropriately interpret the effect being driven by the N5: Impulsiveness facet. We propose that the same logic should be extended to indicators of single traits or facets, that is, to single items.
There is increasing evidence that the personality trait hierarchy may be meaningfully extended below facets. A recent analysis of cross–rater agreement across broad domains, facets, and single items concluded that there was a significant amount of descriptive variance left in NEO–PI–3 items that raters agreed on, once the effects of broad domains and facets had been removed (Mõttus, McCrae, Allik, & Realo, 2014). Similarly, another paper attempted to build unidimensional confirmatory factor analysis models of NEO–PI–3 facets; to achieve reasonable fit indices, almost all facets required allow additional residual correlations between items (S2 in Mõttus et al., 2015), demonstrating additional covariance—and thereby ‘signal’—below the facets. McCrae (2014) suggested that trait–like characteristics below the level of facets should be called nuances. As these nuances have not been documented yet, we reserve our analysis to items. Similar to nuances, single items might reflect different small trait–like features.
The benefit of understanding what causes a scale–outcome association is schematically shown in Figure 1. When there is a domain–level association with an outcome, then all the sub–indicators should be correlated with the outcome as well (Figure 1A). When the association pertains to the facet instead of the domain (Figure 1B), then there might be a weaker domain–level correlation. However, it is more precise to claim that the facet is causing the trait–outcome association, not the broad domain. Similarly, if there is a strong true association between single item (or nuance) and an outcome, a diluted effect might be visible at facet or even at domain level. However, it would be more precise to attribute the association to the nuance.

Trait–outcome associations at different levels of personality measurement. Filled circle = trait's association with an outcome that is indicator indifferent. Semi–filled circle = trait's association with an outcome that depends on a certain indicator. Empty circle = no association between trait and outcome. (A) Outcome relates to the domain–level trait. Therefore, the outcome also relates to all domain's indicators (i.e. facets), and also to all indicators of facets (i.e. items, also interpretable as nuances (McCrae, 2014)). (B) Outcome relates to the facet–level trait. The outcome could also correlate with the domain–level trait sum score. However, because the outcome only correlates with a single indicator of the domain, the outcome pertains only to the facet level as such. (C) Outcome relates to a single nuance. The association might be visible also at domain– and facet–level sum scores. However, in both cases, no other indicators relate to the outcome. Therefore, the association pertains to the nuance level.
Indifference of indicator: Theoretical background
The logic outlined earlier originates from Spearman's (1927) theorem called indifference of indicator (ION
1
). He introduced ION in the context of measuring general intelligence (g):
for the purpose of indicating the amount of g possessed by a person, any test will do just as well as any other, provided only that its correlation with g is equally high (Spearman, 1927, p. 197).
Here, we take the assumption that personality traits are of reflective type, whereby the trait is a latent common cause of its indicators (e.g. items; Bollen & Lennox, 1991; Borsboom, Mellenbergh, & van Heerden, 2003). Although we acknowledge that there may be other views as to the nature of personality traits and embrace them, assuming traits to be reflective seems to be the most common trait interpretation (Borsboom, 2006; DeYoung, 2014; McCrae, 2014). According to this interpretation, the true variance of a trait is manifested in the common variance of its indicators. The common variance can be explicitly modelled, as in case of latent trait modelling, or assumed to emerge via aggregation, as in case of classical test theory applications. The indicators are ‘noisy’, and the degree to which they reflect the underlying latent trait can be expressed in terms of factor loadings or item–total correlations.
Consequently, if an outcome correlates with a trait, none of the trait's indicators (i.e. items) should correlate with the outcome more than the aggregate trait score. This is because the associations between single indicators and the outcome are indirect, mediated by the underlying trait. Single indicator–outcome correlations should then equal the products of the trait–outcome correlation and the factor loadings of the respective indicators. Therefore, the single indicator–outcome correlations should be proportional to their factor loadings, if it is only the latent trait that correlates with the outcome. However, this might not always be the case, as seen in the N5: Impulsiveness–BMI association addressed earlier. It seems that personality scales have not always been explicitly designed to be unidimensional (cf. Gerbing & Anderson, 1988; Gorsuch, 1997). Therefore, the items in a scale might have other sources of covariance than the trait they have been intended to measure. When such a scale is related to the outcome, it might not be the main trait but these other sources of covariance that cause the correlation between a scale and an outcome. In the case of N5: Impulsiveness, it seems that only eating–related items and not generic impulsivity items are associated with BMI. The question then is this: How can we test if an outcome relates to the trait itself or only to a certain subset of its indicators?
Methods for detecting indifference of indicator
A natural test of ION would be to measure a purported trait with multiple scales that sample different indicators of the trait and have roughly similar reliabilities or determinacies: If ION holds, their correlations with the outcome should be similar (Mõttus, Luciano, Starr, Pollard, & Deary, 2013). However, researchers can rarely afford to measure traits with multiple scales.
One may then estimate the presence of ION by correlating the outcome of interest to single items. This is a very straightforward procedure but appears rarely employed in practice. If most items of a scale correlate with the outcome and the correlations are in the same direction, this may be consistent with the presence of ION; preferably, no single item–outcome correlation should exceed the scale–outcome correlation, as was discussed earlier. However, this method does not provide a formal test for deciding, whether some items violate the principle of ION. For example, some variability across items in their outcome correlations is expected owing to differential factor loadings or measurement noise.
To obtain a more formal test of ION, one could employ the ‘method of correlated vectors’ (Jensen, 1998). According to this method, there should be a positive correlation between the factor loadings and outcome correlations of trait indicators. Indicators deviating from the regression line can be considered to violate ION. However, this method has been criticised on several grounds. For example, it appears to be severely biased in favour of observed associations pertaining to the underlying trait (Lubke, Dolan, & Kelderman, 2001) and also sensitive to the selection of trait indicators (Ashton & Lee, 2005).
Therefore, a procedure devoted to testing ION is needed. We propose a simple indicator exclusion approach that provides an opportunity to statistically test whether a scale–outcome relationship meets the ION assumption and, if it does not, which items are causing the lack of ION. Namely, stemming from the principles of ION, it should not matter which items happen to be used to capture the trait. Therefore, one can systematically exclude any single item from a scale and observe the resulting changes in trait–outcome correlations. The correlations between original trait–outcome and reduced–trait outcome trait should not differ much from each other if it is the underlying trait that is related to the outcome. However, these correlations are likely to vary more if only certain items of a scale relate to the outcome. For instance, in the example of N5: Impulsiveness, if it is trait Impulsiveness that relates to BMI, the scale–outcome correlation should not change much, when the eating–related items are removed from the scale. However, if such removal causes the scale–outcome correlation to change or vanish altogether, this implies that the scale's association with BMI is to some extent or entirely specific to certain items.
The detailed procedure is as follows. For each item, correlation between the scale's sum scores and outcome is calculated such that the particular item is excluded from the sum scores. 2 Each of the obtained correlations will then be compared with the original scale–outcome correlation (sum score of all items). This comparison can be conducted with William's test for two dependent correlations that share one variable (Steiger, 1980). William's test characterises difference between correlations with a p–value—a small p–value indicates that the tested difference between correlations is unlikely to have happened by chance and could be considered a real difference. Thus, each item will receive a p–value characterising the ‘significance’ of difference between correlations—here called ‘significance of indicator exclusion’ (SONE 2 ).
The items are then ordered according to the SONE. If the lowest SONE is above a certain p–criterion (defined later), the trait–outcome relationship can be considered indicator indifferent. Conversely, if there is an item with a SONE below a conventional criterion, partial or complete lack of ION is likely due to that particular item. Therefore, the item with the lowest SONE should be excluded from the scale to establish an indicator–indifferent trait–outcome relationship. The whole procedure is then repeated on the remaining set of items to determine if any additional items are compromising ION. The procedure stops if no more items obtain a SONE below the criterion. As a result, the list of excluded items represents items that have a correlation with the outcome independent of the scale's ostensible underlying trait.
As always, there is a question of an optimal criterion p–value. The significance criterion has to be a trade–off between Type I and Type II errors, given sample size. In smaller samples, the power to detect correlation difference is smaller, and therefore, a more lenient criterion is needed (which obviously increases Type I error rate). In large samples, the procedure is likely to flag very small changes in correlation as significant, even if it does signal only trivial lack of ION. To establish an optimal significance criterion for a given sample size, a simulation was conducted. The simulation modelled scale–outcome associations that either had or did not have ION. An optimal cut–off criterion was selected that minimised the likelihood of excluding items if the association was indicator indifferent and maximised the likelihood of excluding items if the association was lacking ION.
Aims of the paper
The current paper seeks to outline the method and the usefulness of testing trait–outcome associations for ION. Study 1 used simulations to demonstrate and test the viability of the proposed indicator exclusion method and establish appropriate p–value criteria. In Study 2, the procedure was applied to empirical data by means of replicating BMI–personality trait associations (Terracciano et al., 2009) in a large population–based sample of adult Estonians. Scales that were significantly correlated with BMI were tested for ION. The BMI–personality relationship was chosen as some papers have explicitly tested if Impulsiveness–BMI association depends on a few items (Iacovino et al., 2014; Terracciano et al., 2009) of the N5: Impulsiveness scale, whereas other papers have not (Sutin, Costa, et al., 2013; Sutin et al., 2011, 2015). The BMI–personality dataset includes both self–reports and informant reports. Given that informant ratings are considered as reliable as self–ratings of personality (Kolar, Funder, & Colvin, 1996), we assume that any robust association will replicate across both rating types. This could be taken as a case of constructive replication (Lykken, 1968).
Study 1: Simulation Study
The goal of the simulation study was to show that ION could be tested with the indicator exclusion approach. The procedure was tested on simulated scales and outcomes, whose relations were or were not indifferent of indicator. We expected that the procedure would correctly highlight items that caused lack of ION between scales and outcomes. The secondary goal of the simulation study was to provide relevant significance criteria for Study 2.
Simulation methods
A trait scale and outcome were simulated according to two sets of scenarios (Figure 2). First, consistent with ION, scores of a single normally distributed trait (t) were simulated (N = 1000, μ = 0, σ = 1). This trait was allowed to contribute to an outcome (o) with a standardised regression weight β, such that o = β * t + εt, where εt was a normally distributed random error term (μ = 0, σ = 1). We tested three different β values 0.15, 0.25 and 0.35. Main focus is on β = 0.25 as such seems to be the effect size of trait N5: Impulsiveness–BMI association in Study 2. The simulated trait was manifested in a scale consisting of eight items. The items had factor loadings λ ranging from 0.4 to 0.7; arguably, these may be seen as rather desirable parameter values (e.g. Ford, MacCallum, & Tait, 1986). We chose to have eight items per scale, similar to the NEO–PI scales, so that the results of Study 1 could be applied in Study 2. This scale is referred to as ‘8 + 0’, indicating that scale does not have any items referring to a secondary trait.

Diagrams of the different scales modelled. In 8 + 0 scenario, the scale–outcome association is indicator indifferent. In other scenarios, the scale–outcome association pertains to a subset of items reflecting a separate trait (T2). Rectangles are observed items or outcomes, and circles denote latent variables. Shapes filled with black denote outcome–related traits and indicators, and white shapes are not related to an outcome. Model parameters are shown once, although they apply in several scenarios. β = trait–outcome association; λ = factor loading; i = item; k = number of items in a scale, here k = 8; o = outcome; r = correlation between traits; T = trait; T1 = trait not related to outcome; T2 = trait related to outcome.
A second set of scenarios reflected lack of ION owing to the lack of unidimensionality in the scale. Instead of a single underlying trait, a composite scale consisting of two underlying correlated traits (T1 and T2) was created. Here, T1 represented the main trait that the scale was purported to measure, whereas T2 represented a trait related to but distinct from T1. Crucially, only T2 was allowed to contribute to the outcome o, which mimics the situation where only a subset of a scale drives the correlation with an outcome. We tested three association strengths: β = 0.15, 0.25 or 0.35. Otherwise, the same procedure as earlier was followed. The correlation between T1 and T2 was set at 0.3, reflecting rather typical inter–facet correlation of personality questionnaires (Ostendorf & Angleitner, 1994). Three versions of an eight–item scale lacking ION were simulated, in which seven to five items were manifestations of T1, and, respectively, three to one items were manifestations of T2. Otherwise, the same item generation procedure as earlier was used. These three scales lacking ION are referred to as ‘7 + 1’, ‘6 + 2’ and ‘5 + 3’.
All scenarios were simulated 10 000 times. Within a simulation, scale–outcome associations were analysed with the indicator exclusion procedure. In scale with ION (8 + 0), no item was excluded. However, we did record the lowest SONE value that was necessary for obtaining an optimal p–value (see succeeding texts).
A similar procedure was conducted with scales without ION (7 + 1, 6 + 2, 5 + 3). Besides calculating the lowest SONE value, we also designed the procedure to remove the item with the lowest SONE and then repeat itself, until no items were supposedly left in T2. For instance, in the case of 7 + 1, we first calculated the lowest SONE, then excluded the respective item and calculated the lowest SONE again based on seven items (referred to as ‘7 + 0’). To estimate the accuracy of the procedure, we inspected whether the item excluded did in fact belong to T2. A similar procedure was conducted on 6 + 2, where one item was excluded first (6 + 1) and another excluded thereafter, leaving a scenario where no more items should be removed (6 + 0).
As stated in the section on Methods for Detecting Indifference of Indicator, the optimal p–criterion should be low enough to minimise the likelihood of excluding items when ION in fact exists. Therefore, p–criterion has to be lower than the lowest SONE value in a scenario where no items should be removed from the scale (8 + 0, 7 + 0, 6 + 0, 5 + 0). At the same time, the criterion should be high enough to maximise the likelihood of excluding items if the association is lacking ION. Therefore, the p–criterion has to be higher than the lowest SONE value in scenarios lacking ION (7 + 1, 6 + 2 and 5 + 3, and also 6 + 1, 5 + 2 and 5 + 1).
For all SONE mean values, we also depicted 95% confidence intervals. All simulations were conducted on sample sizes 100, 250, 500, 750, 1000, 2500 and 5000 and with three different β values. Analyses were conducted in R environment (R Core Team, 2013), occasionally relying on ‘psych’ package (Revelle, 2014).
Results and Discussion
Table 1 shows how the item exclusion's accuracy depends on both the effect size in question and sample size. For larger effects (β = 0.35), violations of ION can be reasonably accurately (95%) detected in samples of 500 or more, whereas for β = 0.15, one would need a sample size of 2500 or more to achieve a comparable level of accuracy. Although the item exclusion procedure requires relatively large samples if expected effect sizes are small, it may be a promising method to detect items causing lack of ION.
Percent of correctly excluded items across scenarios lacking ION
Note; 10 000 simulations per cell. β = association strength between trait (T2) and outcome. See text for detailed explanation of scenarios.
However, in real data, we do not know a priori how many items need to be removed. Therefore, p–criteria are needed for one to be able to decide how many and which items to exclude. To define the maximum and minimum criteria, the average SONE values were calculated across simulations for all sample sizes. The results for β = 0.25 are depicted in Figure 3. It can be seen that mean SONE was greater for scales with ION (solid lines in Figure 3B–D) than for scales without ION (dashed lines in Figure 3B–D) for all sample sizes. This suggests that an optimal p–value can be found that lies between those two extremes. Further, once the samples became 1000 or more, the confidence intervals for the scales that lack ION (dashed lines in Figure 3B–D) became smaller, suggesting that lack of ION could be more clearly detected in larger samples. Similar figures β = 0.15 and β = 0.35 can be seen in the Supporting Information (Figures S1 and S2).

Mean minimum ‘significance of indicator exclusion’ (SONE) values across different simulation conditions and different sample sizes. (A) The scale was designed to have an indicator–indifferent (ION) relationship with an outcome. (B–D) The scale–outcome relationship was not indicator indifferent, such that most items correlated with the main trait (T1), but the outcome related to a sub–trait (T2) represented by one to three items only. The T2–related items were then iteratively removed (e.g. 5 + 2 refers to a scale from which one item had been removed). Trait–outcome association (β) = 0.25.
To define the maximum values of the p–criteria, we first identified the lowest mean SONE across scenarios where no items had to be removed, which in case of β = 0.25 corresponded to the lowest solid line across Figure 3 (i.e. Figure 3A). Thereafter, to define the minimum values for the p–criteria, we identified the highest SONE values when a trait was lacking ION, which corresponded to the highest dashed line across Figure 3 (i.e. 1 + 5 scenario in Figure 3D). These two scenarios have been plotted again in Figure 4.

Detecting optimal p–criteria. Minimum criteria: highest ‘significance of indicator exclusion’ (SONE) values that excluded items causing lack of indifference of indicator (scenario 5 + 1 from Figure 3D). Maximum criteria: lowest SONE values that did not exclude items that belonged to a trait related to outcome (scenario 8 + 0 from Figure 3A). Optimal criteria: geometric mean between minimum and maximum. Trait–outcome association (β) = 0.25.
Apparently, in smaller samples, it was impossible to find an optimal p–value, as minimum criteria were larger than maximum criteria (Figure 4). In case of β = 0.25, 750 was the minimum sample size from which onwards it was possible to obtain p–values (Figure 4). A likely reason is that with smaller sample sizes, it is very hard to reliably exclude the correct item in a 5 + 1 scenario—the uncertainty of this scenario is also illustrated by relatively wide confidence intervals in Figure 3(D) and with lowest percent of correctly excluded items in Table 1 . Where it was possible to find and optimal p–value, this value was determined by the geometric mean between maximum and minimum. Optimal p–values for different effect sizes and scenarios are provided in Table 2. Here is an example with a sample size of 2500 and β = 0.25: If removal of an item causes the scale–outcome correlation to change with a significance below 0.0037, then this item is causing violation in ION and has a separate relationship with the outcome.
Optimal p–values for different trait–outcome associations and sample sizes
Note: β = association between trait and outcome; – = optimal p–value cannot be determined.
Study 2: Personality–BMI Relationships
Next, we sought to apply the tools outlined in the simulation analysis to study the presence or absence of ION in personality scale–BMI associations. We first screened for personality traits that related to BMI in both self–report and informant report. This also enabled us to verify whether ION testing would replicate.
Methods
Participants
Participants were drawn from the Estonian Genome Center, University of Tartu (EGCUT). The EGCUT was launched as an initiative of the Estonian Government in 2001 to create a database of health, genealogical and genome data representing 5% of the Estonian population (Leitsalu et al., 2014). EGCUT participants were randomly selected from individuals visiting general practitioners (GPs) and hospitals, recruited by GPs and hospital physicians. All participants gave informed consent. In addition to donating blood samples and answering a medical questionnaire, participants were asked to complete the self–report version of a comprehensive personality test and find a knowledgeable informant who could complete the same questionnaire about them.
In total, the sample used in the present study included 2581 people (of whom 1398 were women) with a mean age of 44.0 years (SD = 17.3, ranging from 18 to 90 years) and a mean BMI of 26.08 (SD = 4.9, ranging from 15.9 to 54.1). Participants’ weight and height were measured when they were recruited. Percent overweight (BMI > 25) in this sample is 52.9%, which matches a survey–based prevalence estimation of 49%, based on 5000 Estonians (Tekkel & Veideman, 2013). However, another estimate that objectively weighed 495 participants representing the population suggested that prevalence of overweight status might reach 67% (Eglit, Ringmets, & Lember, 2013). Of the 2581 participants, 8.2% people had elementary education, 24.5% had secondary school education, 27.9% had secondary specialised education and 39.4% had a higher education degree. Of the informants, 52% were spouses or partners, 15% friends, 15% parents, 6% children or grandchildren, 6% siblings, 3% acquaintances and 3% other relatives. The informants were 72% female, and the mean age of informants was 42.4 years (SD = 16.1, ranging from 11 to 89 years). Overall, 5.6% of informants had elementary education, 25% had secondary education, 27.1% had secondary specialised education and 42.2% had a higher educational degree.
Measures
The NEO Personality Inventory–3 (NEO–PI–3; McCrae & Costa, 2010) is a slightly modified version of the NEO PI–R questionnaire (Costa & McCrae, 1992) that was translated into Estonian by Kallasmaa and colleagues (Kallasmaa, Allik, Realo, & McCrae, 2000). Like the original NEO PI–R, the NEO–PI–3 has 240 items, which measure 30 personality traits grouped into the five–factor model domains. The NEO–PI–3 has excellent psychometric properties in a wide range of countries, including Estonia (De Fruyt, De Bolle, McCrae, Terracciano, & Costa, 2009). Participants themselves completed the self–report form of the NEO–PI–3. Informants completed the observer report form. In line with typical findings (Connolly, Kavanagh, & Viswesvaran, 2007), the correlations between the respective scale scores based on self–reports and informant reports were 0.53, 0.66, 0.61, 0.47 and 0.53 for Neuroticism, Extraversion, Openness, Agreeableness and Conscientiousness, respectively, and ranged from 0.39 to 0.62 (median = 0.46) for the 30 facet scales. For single items, the self–report–informant report correlations ranged from 0.13 to 0.56 (median = 0.30). All reported correlations were significant at p < 0.01.
Analytic strategy
Body mass index was log–transformed owing to its skewed distribution and residualised for the effects of age, gender and education, as these variables might have confounded personality–obesity relationships (Armon, Melamed, Shirom, Shapira, & Berliner, 2013; Brummett et al., 2006; Ogden et al., 2006; Rolls, Fedoroff, & Guthrie, 1991; Sutin, Costa et al., 2013; Tekkel & Veideman, 2013). First, a correlation analysis of BMI and NEO–PI–3 domain and facet scales was performed as a replication of earlier studies (Sutin et al., 2011; Terracciano et al., 2009). Only scales significantly (p < 0.01) correlating with BMI in both self–reports and informant reports were taken further for the ION testing procedure. For ION testing, the optimal p–criterion was chosen based on the previously described simulation results, considering our sample size and assumed trait–outcome association (β), which was set close to the highest single item–outcome correlation. We preferred single item correlations because if a scale violated ION, and then the sum score–outcome correlation could be misleadingly low (Figure 1). In the indicator exclusion procedure, indicators causing violation in ION were excluded iteratively until no SONE value was below the criteria. The results were plotted in a single item–outcome correlation plot, with excluded items highlighted (Figure 5, ‘single indicator–outcome correlation’). To further demonstrate that the excluded items might have their own relationship with the BMI, the scale–BMI relationship was graphed with and without the excluded items (Figure 5, ‘Scale–outcome correlation’). Some scale–BMI correlations were too small to be properly tested for ION; in these cases, only the single item–outcome correlations were plotted for preliminary assessment of ION (Figure 6). Barplots were plotted using ‘gplots’ package (Warnes et al., 2014).

Testing relationship between body mass index (BMI), and N5: Impulsiveness for indifference of the indicator with indicator exclusion procedure.

Single item–outcome associations between body mass index (BMI), E3: Assertiveness and C2: Order.
Results
Correlations between BMI and personality scales
Table 3 lists the generally weak correlation coefficients between BMI and personality scales in both self–ratings and informant ratings. Three facets correlated with BMI in both self–report and informant report and were thus selected for further analysis: BMI related positively to N5: Impulsiveness and E3: Assertiveness and negatively to C2: Order.
Residualised correlation coefficients between personality traits and BMI, across self–ratings and informant ratings
Note: Body mass index (BMI) has been residualised for the effects of age, gender and education. Bold = significant at p < 0.01. rp = residualised correlation coefficient. Italic facets reported effects in both self–ratings and informant ratings and were used in further analysis.
Testing scale–BMI associations for ION
The optimal p–criterion for N5: Impulsiveness was 0.0037, as some items approached a correlation of 0.25 with BMI (Table 2, Figure 5). The N5: Impulsiveness failed to meet ION (Figure 5). In particular, indicator exclusion procedure suggested that scale's relationship with BMI depended on two eating–related items 4 (#111—‘I tend to eat too much of my favourite food’ and #171—‘Sometimes I am not able to control my appetite’), as well as a more general impulse control item (#51—‘It is hard for me to control my impulses’).
The effects of other scales were too small for properly testing for ION. Simulations with β = 0.10 and n = 2500 had revealed that percentage of correctly excluded items could range from 60% to 85% and that optimal p–criterion could not be determined. However, for a preliminary assessment, we plotted the single item–outcome correlations. Figure 6 suggests that, in contrast to N5: Impulsiveness, the associations of E3: Assertiveness and C2: Order scores with BMI were more likely to pertain to the core traits of the scales as over half of the items related to the outcome in a roughly equal level. At the same time, there were small inconsistencies as some items had very low correlation—future studies with higher power can formally test these associations for ION.
Discussion
Testing personality–BMI relationships revealed that the well–documented correlation between BMI and Impulsiveness depended on a subset of items, mostly those relating to eating. The associations therefore fail to meet the assumption of ION. The effects of E3: Assertiveness and C2: Order were more likely to be indicator indifferent, but we were unable to formally test these associations owing to small effect sizes. All these effects manifested in both self–reports and informant reports, supporting the robustness of the findings.
There could be two explanations for the lack of ION in case of N5: Impulsiveness–BMI association. First, the two eating–related items could reflect eating–related impulsivity, a construct that has been suggested to be more BMI relevant than domain–general impulse control across various measures (Houben, Nederkoorn, & Jansen, 2013; Rasmussen, Lawyer, & Reilly, 2010; Tsukayama, Duckworth, & Kim, 2012; also Vainik, Dagher, Dubé, & Fellows, 2013). For instance, the three BMI–related items from Impulsiveness relate strongly to other eating–related scales (Vainik, Neseliler, Konstabel, Fellows, & Dagher, 2015). Another interpretation could be that items asking about overeating are logically so close to BMI as to make any meaningful conclusion difficult. Namely, people might observe first that they are overweight and then conclude that they are unable to control themselves. For instance, N5: Impulsiveness is known to change in parallel with weight status (Sutin, Costa et al., 2013). Whichever stance is taken, either of these interpretations is more precise than claiming that trait Impulsiveness is the underlying attribute relating to BMI. Based on these results, there would be no point saying it is impulsiveness as some sort of underlying trait that correlates with BMI, and this is an important finding in its own right.
The sum score of N5: Impulsiveness has been related to several other interesting outcomes, including BMI change, eating behaviours and disorders, leptin levels, white blood cell counts, drug and alcohol consumption, gambling and brain activity, such as dopamine secretion and reward responsiveness (Bagby et al., 2007; Elfhag & Morey, 2008; Jen, Saunders, Ornstein, Kamali, & McInnis, 2013; Oswald et al., 2007; Ruiz, Pincus, & Dickinson, 2003; Sutin, Costa et al., 2013; Sutin, Evans, & Zonderman, 2013; Sutin, Zonderman et al., 2013; Sutin et al., 2011, 2012; Villafuerte et al., 2012). It would be interesting to reanalyse these effects for ION to understand if these outcomes pertain to trait Impulsiveness as the underlying attribute, or something more specific (cf. Iacovino et al., 2014; Terracciano et al., 2009).
The effects of E3: Assertiveness and C2: Order on BMI are very small, but they have been repeatedly found across several studies (Sutin et al., 2011; Terracciano et al., 2009). The effect of Order suggests that more organised persons have lower risk for obesity. A potential mechanism could be consistency in eating patterns—having similar meals across eating episodes has been shown to relate positively to successful weight maintenance and other health indices (Gorin, Phelan, Wing, & Hill, 2003; Pachucki, 2012; Wing & Phelan, 2005; see Vainik, Dubé, Lu, & Fellows, 2015, for further discussion). The effect of assertiveness seems to be instrument specific; studies with other personality instruments suggest instead that lack of assertiveness relates to maladaptive eating behaviours (Elfhag, 2005; Elfhag & Erlanson–Albertsson, 2006). This, obviously, points to potential lack of ION, which could be tested by linking items from multiple assertiveness scales to BMI scale.
General Discussion
The current paper has outlined a procedure to test if a trait–outcome association pertains to the whole scale or to its particular items. Guided by Spearman's (1927) theorem of ION, we suggest that all indicators of an underlying trait should similarly relate to the outcome. To apply this theorem in a personality context, we designed an indicator exclusion procedure that tests whether exclusion of an item significantly influences scale–outcome correlation.
Testing for ION is likely to be widely applicable, given that most personality trait–outcome research has so far been exclusively based on linking outcomes with aggregate scores. The point we want to emphasise is that the test entails win–win situations. In cases where the observable trait–outcome associations appear indicator indifferent, the findings may appear even more robust. Otherwise, testing for ION may result in a more detailed description of the personality characteristics relating to focal outcomes. Therefore, while the sum–score approach can be used, supplementing it with ION testing provides greater confidence in or a better understanding of the observed effects.
Are some personality scales more likely to lack ION with outcomes? There is no clear answer, as various personality tests have been built using very different psychometric standards (Borsboom, 2006). Some scales have been constructed using rigorous standards (Gerbing & Anderson, 1988), which has tended to result in unidimensional scales. Unidimensional scales may be more likely to have indicator–indifferent relationship with an outcome, as the scales have been designed to reflect a single trait. In contrast, a scale–outcome association is more likely to lack ION if a scale has been constructed using other popular psychometric approaches that do not guarantee unidimensionality, such as reliance on Cronbach alpha (e.g. Dunn, Baguley, & Brunsden, 2014; Green, Lissitz, & Mulaik, 1977) and principal component analysis (e.g. Gorsuch, 1997). Such scales are more likely to incorporate multiple underlying traits, of which only one or some may relate to a particular outcome.
The indicator exclusion procedure provides a statistical method to test whether removal of an indicator causes the trait–outcome correlation to significantly change. This is a very straightforward test of ION, both conceptually and methodologically. The scripts to estimate a p–criterion and run the analysis in the R environment are available at http://www.ut.ee/uku.vainik/ion/. An even simpler approach is to correlate single items with the outcome, which provides preliminary hints as to which items are related to the trait–outcome relationship.
Regarding limitations, it is important to note that all these methods are based on correlation and hence are likely to have similar assumptions to data and to the sample size (e.g. Schönbrodt & Perugini, 2013). Further, the effectiveness of ION depends on the sample size and the effect size in question. Hopefully, future optimisations of ION testing can reduce the requirements of sample size. In Study 2, we were able to test our main association of interest—N5: Impulsiveness–BMI. However, smaller trait–BMI associations remain to be tested in larger studies. For instance, one could employ the 1958 National Child Development Study that has over 9000 persons (TNS BMRB, 2014). At the same time, larger effect sizes can be well studied in smaller samples (Table 2). The R script published alongside this article can used to test beforehand, which samples are sufficient for given effect size.
There is good reason to expect that other outcomes than BMI could also have item–specific variance in NEO–PI. As highlighted in the Introduction, a few recent studies propose that there is considerable meaningful variance left in NEO–PI below facet level (McCrae, 2014; Mõttus et al., 2014; Mõttus et al., 2015). Interpreting item or nuance–level variance might provide similar benefits, as interpreting facets has provided over domains (Briley & Tucker–Drob, 2012; Costa & McCrae, 1995; Judge et al., 2013). The conventional wisdom may be that single items are infused with (random) measurement error and should not be used for substantive research; aggregates may reduce (random) measurement error and therefore better suited for being linked with other variables. This assumption may need a re–assessment. In this study, we observed that often single items predicted BMI better or at least with the same magnitude than the aggregate scores they were part of. In fact, this was true for all of the three facet scales considered in more detail. This may suggest that single items are worth being considered as substantive variables in their own right rather than mere measurement devices.
In conclusion, we have proposed a method that clarifies if trait–outcome associations are caused by scales or particular items. The study illustrates that Spearman's theorem of ION is well suited for better understanding personality trait–outcome relationships. We outlined the principles of testing indicator indifference and demonstrated the indicator exclusion procedure in a simulation study. We then applied the procedure to clarify the relationship between personality facets and BMI using a large sample with diverse demographic backgrounds and detailed personality data from self–reports and informant reports. We hope that testing for ION will lead to a more precise understanding of personality trait–outcome relationships.
Supporting info item
Supporting info item, per2009-sup-0001-supplementary - Are Trait–Outcome Associations Caused by Scales Or Particular Items? Example Analysis of Personality Facets and Bmi
Supporting info item, per2009-sup-0001-supplementary for Are Trait–Outcome Associations Caused by Scales Or Particular Items? Example Analysis of Personality Facets and Bmi by Vainik Uku, Mõttus René, Allik Jüri, Esko Tõnu and Realo Anu in European Journal of Personality
Footnotes
Acknowledgements
We would like to thank Delaney Michaell Skerret, Kenn Konstabel, Lesley Fellows, Maarika Paaver, Tom Booth and anonymous reviewers for their valuable comments on earlier versions of the paper, as well as Andres Metspalu for his support.
The Estonian Genome Center of the University of Tartu was financed by two FP7 grants (201413, 245536). It also received targeted financing from the Estonian Government (SF0180142s08), from the University of Tartu within the framework of the Center of Translational Genomics and from the European Union through the European Regional Development Fund within the framework of the Centre of Excellence in Genomics. This study was also supported by research funding from the University of Tartu (SP1GVARENG) and by an institutional research funding (IUT2–13) from the Estonian Ministry of Education and Science (IUT2–13). Anu Realo was supported by a grant from the Netherlands Institute for Advanced Study (NIAS) during the preparation of this article.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
