Abstract
The psychometric properties of the English-language NIH Toolbox for Assessment of Neurological and Behavioral Function (NIH Toolbox) have been examined in numerous populations. This study evaluated the reliability and validity of the Spanish-language NIH Toolbox. Participants were children aged 3 to 7 years and adults aged 18 to 85 years who took part in the NIH Toolbox norming study in Spanish. Results supported the internal consistency reliability of included measures. Test–retest reliability was strong for most tests, though it was weaker for the test of olfaction among children and the test of locomotion among adults. Spearman’s correlations and general linear models showed Spanish tests were often associated with age, sex, and education. Convergent validity for the two language measures that underwent more intensive development, evaluated via Spearman’s correlations with legacy measures, was strong. Results support using the Spanish-language NIH Toolbox to measure neurological and behavioral functioning among Spanish-speaking individuals in the United States.
Over the previous half century, the Latinx population in the United States has increased nine fold, currently representing approximately 18% of the overall population (U.S. Census Bureau, 2018a). Future projections indicate that by 2060, Latinxs will represent 27.5% of the population, reflecting a near doubling of the current number of Latinxs living in the United States (U.S. Census Bureau, 2018b). While English is the dominant language of the United States, as of 2011 only approximately half of Latinxs living in the United States reported being able to speak English “very well” (Ryan, 2013). Moreover, of the more than 60 million individuals aged 5 years and older living in the United States that reported speaking a language other than English at home, nearly two thirds (62%) spoke Spanish (Ryan, 2013). Given these increases, consideration of the linguistic needs of the Latinx population in the United States is of particular importance for clinicians and researchers who wish to evaluate different components of neurological and behavioral functioning in multicultural contexts.
In recognition of this, the NIH Toolbox for Assessment of Neurological and Behavioral Function® (NIH Toolbox®) was developed as a collection of neurobehavioral measures designed to quickly assess sensory, motor, emotional, and cognitive functioning across the life span (aged 3 to 85 years) in both English and Spanish. It was created to be appropriate for use in longitudinal studies by measuring the same constructs at different stages of human development, and can be used to track individual performance over time. Additionally, it was designed to provide a “common currency” in epidemiologic research, enabling the pooling and sharing of large data sets across studies. The full NIH Toolbox measurement system can be administered in 2 hours or less and, although it was not originally developed for clinical use, a growing body of evidence demonstrates its applicability for evaluating functioning in various normative and clinical populations; for example, cancer (Sinha et al., 2018), neurologic conditions (Tulsky & Heinemann, 2017), and social anxiety (Troller-Renfree et al., 2015).
The psychometric properties of the English-language version of the NIH Toolbox have been examined in the general U.S. population and numerous clinical populations. For example, prior studies have demonstrated support for the reliability and validity of the majority of the NIH Toolbox measures in healthy samples in English (Coldwell et al., 2013; Dalton et al., 2013; Reuben et al., 2013; Rine et al., 2013; Salsman et al., 2013; Varma et al., 2013; Weintraub et al., 2013). Additionally, the clinical utility and psychometric strength of the tests of cognitive functioning have been demonstrated among individuals with disabilities (Tulsky & Heinemann, 2017), and among children aged 3 to 15 years (Bauer & Zelazo, 2013). Although normative standards for the Spanish-language tests of cognitive (Casaletto et al., 2016) and emotional (Babakhanyan et al., 2018) functioning have been published, a comprehensive psychometric evaluation of the Spanish NIH Toolbox measures has not. In the present study, we explore the reliability and validity of the Spanish-language version of the NIH Toolbox for children and adults across four domains of functioning: sensory, motor, emotional, and cognitive.
Method
Participants and Procedures
Data from the NIH Toolbox norming study were used. These data are publicly available and can be accessed via the HealthMeasures Dataverse data repository (https://dataverse.harvard.edu/dataverse/HealthMeasures). Participants in the NIH Toolbox norming study were community-dwelling children and adults who were neurologically healthy and capable of following instructions. A sampling strategy stratified by age, sex, and primary language was followed in accordance with the published NIH Toolbox norming plans (Beaumont et al., 2013). Participants were recruited by the market research company Delve, Inc. (now known as Focus Pointe Global) from 10 locations throughout the United States (Atlanta, Chicago—Oak Brook, Cincinnati, Columbus, Dallas, Los Angeles, Minneapolis, Philadelphia, Phoenix, and St. Louis) that were selected, among other reasons, to maximize access to subjects from varied Spanish-speaking communities. To be eligible for inclusion, potential participants had to have adequate visual, auditory, vestibular, and motor functioning, either independently or with support from assistive devices, to enable completion of all items included in the NIH Toolbox testing battery. The present analysis included those participants aged 3 to 7 years (n = 496) and aged 18 to 85 years (n = 408) who elected to participate in the NIH Toolbox norming study in Spanish. Spanish-speaking children between age 8 and 17 years were not included in the NIH Toolbox norming study, as census data indicated that less than 2% of children in this age range living in the United States at the time of study enrollment used Spanish as their primary language (Beaumont et al., 2013).
Potential participants met with trained research personnel who administered structured interviews and standardized questionnaires to ensure eligibility prior to enrollment. Informed consent was obtained from all adult participants. Parental informed consent was obtained from children aged 3 to 7 years, and assent was obtained from children aged 7 years. A subset of the individuals who participated in the NIH Toolbox norming study repeated the NIH Toolbox measures 5 to 14 days following initial test administration to enable evaluation of test–retest reliability. The NIH Toolbox norming study was approved by the institutional review board at Northwestern University through a protocol that covered all testing sites and was completed in accordance with the Helsinki Declaration.
Measures
The NIH Toolbox for Assessment of Neurological and Behavioral Functioning (Gershon et al., 2013)
As outlined above, the NIH Toolbox is composed of assessments targeting four primary domains and contributing to four batteries: sensation, motor, emotion, and cognition. The full battery includes all assessments, while an early childhood battery includes a subset thereof, as described next. Most measures were originally developed in English and then translated into Spanish using the Functional Assessment of Chronic Illness Therapy translation methodology (Bonomi et al., 1996; Cella et al., 1998; Eremenco et al., 2005; Lent et al., 1999), although the Cognition Battery language measures were effectively developed from scratch in Spanish. Detailed information regarding the cultural adaptation, linguistic translation, and overall development of the Spanish-language version of the NIH Toolbox is available elsewhere (Gershon et al., 2020).
NIH Toolbox Sensation Battery
The Sensation Battery (Coldwell et al., 2013; Cook et al., 2013; Dalton et al., 2013; Varma et al., 2013; Zecker et al., 2013) is composed of six measures assessing five domains: (1) audition, (2) gustation, (3) vision, (4) olfaction, and (5) pain. Audition is assessed using the NIH Toolbox Words-in-Noise Test (Zecker et al., 2013), which tests hearing in a noisy environment and is appropriate for individuals aged 6 to 85 years. Assessment of speech perception in noise is considered an ecologically valid measure of auditory functioning, as real-world communication often occurs in noisy environments. A Spanish-language version of the Words-in-Noise test was developed separately and was adapted for use in the NIH Toolbox (McArdle et al., 2009). Scores on this measure are presented as decibels of signal-to-noise ratio (dB S/N), reflecting the quietest signal correctly perceived amid noise. Lower scores reflect better hearing. Gustation is assessed with the NIH Toolbox Regional Taste Intensity Test (Coldwell et al., 2013), which measures the perceived intensity of quinine (i.e., bitter taste) and salt as administered in liquid. This test is appropriate for individuals aged 12 to 85 years, and scores are presented on a generalized labeled magnitude scale. Higher scores reflect greater perceived taste intensity. Vision is assessed with the NIH Toolbox Visual Acuity Test (Varma et al., 2013), which measures distance vision at three meters and is appropriate for individuals aged 3 to 85 years. Scores are presented as LogMAR units and in Snellen format; higher scores reflect worse vision. Olfaction is assessed with the NIH Toolbox Odor Identification Test (Dalton et al., 2013), which assesses a person’s ability to identify different odors using scratch-and-sniff cards, and is appropriate for individuals aged 3 to 85 years. Nine odorants are presented to individuals aged 10+ years, and five are presented to individuals aged 3 to 9 years. Scores reflect the number of odorants correctly identified, with higher scores reflecting better smell. Pain is assessed with the Pain Intensity Survey and the Pain Interference Survey (Cook et al., 2013), which are appropriate for individuals aged 18 to 85 years. The Pain Intensity Survey is a single-item numerical rating scale assessing pain severity over the prior week; higher scores indicate more severe pain. The Pain Interference Survey is administered as a computer adaptive test and assesses the degree to which pain interferes with engagement in normal activities. Scores are presented as T-scores, with higher scores indicating more pain interference.
NIH Toolbox Motor Battery (Reuben et al., 2013)
The Motor Battery is composed of five measures assessing five domains: (1) dexterity, (2) strength, (3) balance, (4) locomotion, and (5) endurance. The NIH Toolbox Motor Battery for individuals aged 7 to 85 years includes all five of these measures. The NIH Toolbox Early Childhood Motor Battery for individuals aged 3 to 6 years includes all of these measures except the test of locomotion. Dexterity is assessed using the 9-hole Pegboard Test. Scores reflect the number of seconds needed to accurately place and remove nine plastic pegs into a plastic pegboard using the dominant hand. Higher scores reflect worse manual dexterity. Strength is assessed using the Grip Strength Test, which evaluates the number of pounds of force generated using the dominant hand on a hand dynamometer. Higher scores indicate greater strength. Balance is assessed using the Standing Balance Test, which evaluates anterior–posterior postural sway. Raw normalized path length scores are calculated based on time and acceleration, and are then converted to theta values using an item response theory (IRT) model. Higher scores reflect better balance. Locomotion is assessed using the 4-Meter Walk Gait Speed Test. Scores are based on the number of seconds it takes to walk 4 meters at one’s usual pace, using the better of two trials, which are then converted to meters/second. Higher scores are indicative of better locomotion. Endurance is assessed using the 2-Minute Walk Test. Scores reflect the number of feet walked in 2 minutes, with higher scores reflecting better endurance.
NIH Toolbox Emotion Battery (Salsman et al., 2013)
The Emotion Battery broadly assesses four domains: (1) negative affect, (2) social relationships, (3) psychological well-being, and (4) stress and self-efficacy. The Spanish-language battery includes 17 self-report scales for adults aged 18 to 85 years, and 10 parent-report scales for children aged 3 to 7 years. In addition to these 27 measures, the English-language battery also includes 15 self-report scales for children aged 8 to 17 years and 11 parent-report scales for children aged 8 to 12 years. Spanish translations of these additional measures are available; however, no norms exist given that children aged 8 to 17 years were not included in the Spanish-language norming sample. Higher scores indicate more of the construct being assessed across all measures included in the Emotion Battery. The negative affect subdomain includes six assessments for adults (Anger-Affect, Anger-Hostility, Anger-Physical Aggression, Sadness, Fear-Affect, and Fear-Somatic Arousal) and four assessments for children (parent-reported assessments of Anger, Fear-Over Anxious, Fear-Separation Anxiety, and Sadness). The psychological well-being subdomain includes three assessments for adults (Positive Affect, Life Satisfaction, and Meaning) and two assessments for children (parent-reported assessments of Positive Affect and Life Satisfaction). The parent-reported assessment of Positive Affect was not administered in Spanish during the norming study and was therefore not included in the present analysis; however, a Spanish-language version of this measure is available without norms. The social relationships subdomain includes six assessments for adults (Friendship, Loneliness, Emotional Support, Perceived Hostility, Instrumental Support, and Perceived Rejection) and four assessments for children (parent-reported assessments of Social Withdrawal, Positive Peer Interactions, Peer Rejection, and Empathic Behaviors). The stress and self-efficacy subdomain includes two assessments for adults (Perceived Stress and Self-Efficacy). There are no assessments for children included in the stress and self-efficacy subdomain. All measures excepting the parent-reported assessments of Life Satisfaction (five items), Social Withdrawal (four items), Positive Peer Interactions (four items), and Peer Rejection (nine items) are scored using IRT methods to yield a theta value. For these four parent-reported measures, scores reflect a raw sum of the included items.
NIH Toolbox Cognition Battery (Weintraub et al., 2013)
The Cognition Battery is composed of seven measures assessing five domains: (1) executive function and attention, (2) episodic memory, (3) working memory, (4) processing speed, and 5) language. Executive function and attention are assessed using the Dimensional Change Card Sort Test and the Flanker Inhibitory Control and Attention Test, which are appropriate for individuals aged 3 to 85 years. Scores on these measures reflect a combination of accuracy and reaction time, where each of these components receives a score between 0 and 5. These scores are then summed to yield a computed score ranging from 0 to 10, with higher scores reflecting better performance. Episodic memory is assessed using the Picture Sequence Memory Test, which is appropriate for individuals aged 3 to 85 years. Scores are computed as IRT-estimated thetas, and reflect the ability to accurately recall an increasingly lengthy series of illustrations. Higher scores reflect better episodic memory. Working memory is assessed using the List Sort Memory Test, which is appropriate for individuals aged 7 to 85 years, with a parallel supplemental test available for children aged 3 to 6 years. Scores are computed as the total number of foods and animals correctly recalled and reordered from smallest to largest, by category. Higher scores reflect better working memory. Processing speed is assessed using the Pattern Comparison Processing Speed Test, which is appropriate for individuals aged 7 to 85 years. Scores reflect the number of times in an 85-second window that an individual can accurately determine if two side-by-side images are identical, with higher scores indicating better processing speed. Language ability is assessed using the Picture Vocabulary Test, which is appropriate for individuals aged 3 to 85 years, and the Oral Reading Recognition Test, which is appropriate for individuals aged 7 to 85 years. Both language measures are administered as computer adaptive tests and scored using IRT methodology to yield a theta score. The score for the Picture Vocabulary Test reflects an individual’s ability to correctly select one of four images to match the meaning of an audio recorded word, and the score for the Oral Reading Recognition Test reflects the ability to read and correctly pronounce letters and words, shown one at a time on a screen. Higher scores reflect better language abilities. The NIH Toolbox Cognition Battery for individuals aged 7 to 85 years includes all seven core measures. The NIH Toolbox Early Childhood Cognition Battery for individuals aged 3 to 6 years includes only the Dimensional Change Card Sort Test, Flanker Inhibitory Control and Attention Test, Picture Sequence Memory Test, and Picture Vocabulary Test.
Sociodemographic Variables
In addition to completing the NIH Toolbox measures, participants self-reported sociodemographic information including age, sex, and years of education, as well as race/ethnicity and Latinx status as measured by the 2010 Census (Humes et al., 2011).
Validity Measures
Convergent validity was not evaluated for most of the Spanish-language versions of the NIH Toolbox measures. However, the two language measures included in the Cognition Battery involved significantly more development than the other measures. Therefore, convergent validity for these two measures was evaluated by correlating scores on these measures with scores on two well-established legacy Spanish-language instruments, as described below.
The Batería-III Woodcock-Muñoz Vocabulario Sobre Dibujos Test (Muñoz-Sandoval et al., 2005)
To enable assessment of the convergent validity of the Picture Vocabulary Test, the Batería-III Woodcock-Muñoz Vocabulario Sobre Dibujos Test, the parallel Spanish-language version of the Woodcock-Johnson® III Picture Vocabulary test (Woodcock et al., 2001), was administered to a subset of participants. The Vocabulario Sobre Dibujos test measures oral language development and word knowledge.
The Word Accentuation Test
To enable assessment of the convergent validity of the Oral Reading Recognition Test, a 48-item version of the Word Accentuation Test, a Spanish-language word recognition task, was administered to a subset of participants. The Word Accentuation Test requires respondents to read words aloud, which are then scored as correct or incorrect based on the accentuation of the word. The 48-item version used in this analysis was created by supplementing the 40 items included in the Word Accentuation Test—Chicago (Krueger et al., 2006), which was developed for use with Spanish speakers in the United States, with eight additional items included in the original Word Accentuation Test (Del Ser et al., 1997), which was developed in Madrid, that are not in the Chicago version.
Analytic Plan
For the present analysis, z scores were first computed for all measures based on the mean and standard deviation for the full sample (e.g., children and adults) combined. This placed scores on a common metric, thus facilitating analysis and interpretation of results. Normality of the data was evaluated via a combination of visual inspection and statistical evaluation (i.e., Kolmogorov–Smirnov and Shapiro–Wilk tests). Nonparametric analytic approaches were implemented as numerous variables were nonnormally distributed.
Reliability
Different types of reliability were calculated for different measures within the NIH Toolbox, as appropriate. Cronbach’s coefficient alphas were calculated to assess internal consistency reliability for those self-report and proxy-report measures within the Emotion Battery that were administered as fixed forms. Marginal reliability coefficients were estimated with graded response models (IRT) for measures within the Emotion Battery that were administered as computer adaptive tests. Spearman’s correlations with associated p values were calculated to evaluate test–retest reliability for measures within the Sensation, Motor, and Cognition Batteries. intraclass correlations (ICCs) were initially calculated as well; however, results were unreliable due to low power and nonnormality of the data. The strength of Spearman’s correlations was evaluated with cutoffs of .10, .30, and .50 indicating small, medium, and large effects, respectively (Cohen, 1992). Because most of the NIH Toolbox Emotion Battery measures are structured to assess a 7-day recall period, the average time span between administrations for the Emotion Battery was greater than 8 days, and emotions would be expected to fluctuate during that interval, test–retest reliability was not evaluated for this battery. Additionally, to reduce overall burden, measures from the Motor Battery assessing dexterity, strength, and balance were not readministered to the Spanish-speaking participants. Accordingly, the test–retest reliability of these measures was not evaluated. Within the Sensation Battery, the gustation and pain assessments were only administered to Spanish-speaking participants aged 18 years and older, per NIH Toolbox administration guidelines, and only two participants aged 3 to 7 years repeated the Words-in-Noise audition measure. Therefore, test–retest reliability was not evaluated for these tests in children.
Validity
Convergent validity: Language measures
Scores on the gold-standard language measures were converted to z scores using the same approach outlined above, so scores on the validity measures and the NIH Toolbox language measures would be on the same metric. Spearman’s correlations between these gold-standard measures and the NIH Toolbox language measures were then calculated to evaluate convergent validity. These comparisons were cross-sectional.
Comparison of scores with demographic characteristics
In addition to the convergent validity analyses conducted for the two NIH Toolbox language measures, the validity of scores on the Spanish-language versions of all measures was evaluated with Spearman’s correlations and general linear models relating scores to demographic subgroup membership, controlling for age and sex as appropriate. These comparisons were cross-sectional. For Cognition Battery measures, these analyses also controlled for education level for adults and mother’s education level for children. For general linear models, effect sizes were reported as Cohen’s ds, and cutoffs of .20, .50, and .80 indicated small, medium, and large effects, respectively (Cohen, 1992). To account for multiple testing, a Bonferroni correction (α = .001) was used in this analysis.
Results
Participant Characteristics
The NIH Toolbox Spanish-language norming sample was composed of 496 children aged 3 to 7 years and 408 adults aged 18 to 85 years. Details regarding participant characteristics are presented in Table 1. Slightly more than 20% of the adult sample was born in the United States, including 2.9% reporting that they were born in Puerto Rico. The majority of adult respondents (77.5%) were born outside of the United States, similar to the sample included in the Pew Research Center’s 2013 National Survey of Latinos and Religion, in which 88.5% of Spanish-dominant respondents reported having been born outside of the United States (Pew Research Center, 2013). Foreign-born participants in the present sample reported having been born in Mexico (43.4%), Central America (i.e., Costa Rica, El Salvador, Guatemala, Honduras, Nicaragua, Panama; 16.4%), the Caribbean (i.e., Cuba, Dominican Republic; 4.2%), South America (i.e., Brazil, Chile, Colombia, Ecuador, Peru, Uruguay, Venezuela; 12.7%), and Europe (i.e., Spain; 0.2%). Information regarding country of origin was not available for the 3.0% of child respondents who were not born in the United States.
Sample Characteristics.
Mean (Standard Deviation). bPercentage (n). cResponse options reflective of question regarding Latinx origin as asked in the 2010 Census.
Reliability
Internal Consistency Reliability
As shown in Table 2, Cronbach’s alphas demonstrated acceptable to strong internal consistency reliability for all measures within the Emotion Battery that were administered as fixed forms among both adults (αs ranged from .77 to .95) and children (αs ranged from .66 to .82).
Internal Consistency and IRT Reliability for NIH Toolbox Emotion Battery Scores.
Note. IRT = item response theory; CAT = computer adaptive test; PR = parent-report.
IRT Reliability
All Emotion Battery measures administered as computer adaptive tests had excellent reliability exceeding .90, with the exception of the Psychological Well-being—Meaning measure, which had an estimated reliability of .87.
Test–Retest Reliability
Of the 904 individuals who participated in the NIH Toolbox norming study and completed questionnaires in Spanish, 73 to 86 participants—depending on the domain—repeated these measures 5 to 14 days following initial test administration (sensation domain: n = 85, M = 8.12 days, SD = 2.63; motor domain: n = 85, M = 8.12 days, SD = 2.63; emotion domain: n = 73, M = 8.45 days, SD = 2.37; cognition domain: n = 86, M = 8.08 days, SD = 2.64. Of note, not all participants completed all measures included in each NIH Toolbox domain, as indicated in Table 3, Table 4, and Table 5.
Spearman’s Test–Retest Correlations for NIH Toolbox Sensation Battery Measures.
Note. WIN = Words-In-Noise Test; Odor ID = Odor Identification; tongue = tip of tongue; mouth = whole mouth. ns reflect number of participants included in test–retest analysis for each NIH Toolbox measure.
Spearman’s Test–Retest Correlations for NIH Toolbox Motor Battery Measures.
Note. Assessments of strength, dexterity, and balance were only administered to Spanish-speaking participants once. ns reflect number of participants included in test–retest analysis for each NIH Toolbox measure.
Spearman’s Test–Retest Correlations for NIH Toolbox Cognition Battery Measures.
Note. DCCS = Dimensional Change Card Sort Test; Flanker = Flanker Inhibitory Control and Attention Test; List Sort = List Sorting Working Memory Test; PSM = Picture Sequence Memory Test; Pattern Comp = Pattern Comparison Processing Speed Test; PVT = Picture Vocabulary Test; ORRT = Oral Reading Recognition Test. ns reflect number of participants included in test–retest analysis for each NIH Toolbox measure.
Sensation Battery
Among adults, analyses supported test–retest reliability based on large effects for the assessments of visual acuity and pain intensity (ρs ranged from .78 to .87). Reliability was lower for assessments of audition, gustation, olfaction, and pain interference, though effects were still medium to large (ρs ranged from .48 to .63). Among children, test–retest reliability was supported for vision based on a large effect (ρ = .69), but was poor for olfaction based on a small effect (ρ = .20; see Table 3).
Motor Battery
Among both children and adults, test–retest reliability was supported for the assessment of endurance based on large effects (ρs ranged from .62 to .71), although it was poor for the assessment of locomotion among adults based on a small effect (ρ = .26; see Table 4).
Cognition Battery
Although all seven measures are appropriate for individuals aged 7 years and older, only one 7-year-old respondent repeated the Cognition Battery. Therefore, test–retest reliability was not evaluated for the measures not included in the Early Childhood Battery (i.e., List Sort Working Memory Test, Pattern Comparison Test, and Oral Reading Recognition Test) among children. Among adults, test–retest reliability was supported by large effects for all tests (ρs ranged from .63 to .88). Similarly, among children, test–retest reliability was demonstrated with large effects for tests of executive function and attention, language, and episodic memory (ρs ranged from .59 to .75; see Table 5).
Convergent Validity: Language Measures
Of the 904 Spanish-language respondents, 385 (ages 3 to 7 years: n = 263; ages 18 to 85 years: n = 122) completed the Batería-III Woodcock-Muñoz Vocabulario Sobre Dibujos test (Muñoz-Sandoval et al., 2005), the legacy measure for the Spanish Picture Vocabulary Test, and 195 (ages 3 to 7 years: n = 103, although only those aged 7 years included in analysis [n = 56]; ages 18 to 85 years: n = 92) completed the 48-item version of the Word Accentuation Test (Del Ser et al., 1997; Krueger et al., 2006), the legacy measure for the Spanish Oral Reading Recognition Test. Among adults, Spearman’s correlations demonstrated good convergent validity based on large effects between the NIH Toolbox Spanish language measures and legacy measures for both the Picture Vocabulary Test (ρ = .76, p < .001) and the Oral Reading Recognition Test (ρ = .65, p < .001). Among children, good convergent validity based on a large effect was found for the Picture Vocabulary Test (ρ = .60, p < .001), although convergent validity was lower with a medium effect for the Oral Reading Recognition Test (ρ = .26, p = .053).
Comparison of Scores With Demographic Characteristics
Sensation Battery
Among adults, significant but small to medium age effects (ρs ranged from −.22 to .44) were found for the Words-in-Noise Test, Visual Acuity Test, Odor Identification Test, and Pain Intensity Survey after controlling for sex. As expected, older age was associated with worse hearing, worse vision, worse olfaction, and greater pain. Among children, after controlling for sex, older age was related to better olfaction based on a medium effect (ρ = .37), and better vision based on a large effect (ρ = −.52; see Table 6). No sex effects were found among adults or children. Although the Words-in-Noise Test of audition is appropriate for respondents age 6 and older, analyses were not conducted for this test among children due to low sample size in the appropriate age range (n = 10).
Effect Sizes for Comparisons of NIH Toolbox Sensation Battery Among demographic Groups.
Note. WIN = Words-In-Noise Test; Odor ID = Odor Identification; tongue = tip of tongue; mouth = whole mouth. ns reflect number of participants included in demographic comparison analyses for each NIH Toolbox measure.
Adjusted for sex. bAdjusted for age.
p < .001.
Motor Battery
Among adults, both small to medium age effects (ρs ranged from .32 to −.42) and small to large sex effects (ds ranged from .24 to .99) were found for nearly all motor domain measures. Younger and male participants performed better. Among children, medium to large age effects (ρs ranged from .48 to −.77) were found for all measures, with older children performing better. Additionally, one small sex effect was found (d = .23), with female children demonstrating better dexterity than male children (see Table 7).
Effect Sizes for Comparisons of NIH Toolbox Motor Battery Among Demographic Groups.
Note. D = dominant hand; ND = nondominant hand; M = male; F = female. ns reflect number of participants included in demographic comparison analyses for each NIH Toolbox measure.
Adjusted for sex. bAdjusted for age.
p < .001.
Emotion Battery
No significant age or sex effects were found among adults. Significantly higher scores were found for older children as compared with younger children on the parent-report measure of Fear-Over Anxious and parent-report measures reflecting positive social interactions based on small effects (ρs ranges from .16 to .22). No other age or sex effects were found among children (see Table 8).
Effect Sizes for Comparisons of NIH Toolbox Emotion Battery Scores Among Demographic Groups.
Note. PR = Parent-report. ns reflect number of participants included in demographic comparison analyses for each NIH Toolbox measure.
Adjusted for sex. bAdjusted for age.
p < .001.
Cognition Battery
Among adults, significant small to medium age effects (ρs ranged from −.36 to −.47) were observed on all measures except the language tests, on which no significant relationships were found. Additionally, significant medium to large associations (ρs ranged from .30 to .59) were found between all cognition measures and education after controlling for age and sex. Younger age and greater educational attainment were associated with better performance. No significant sex effects were found among adults after adjusting for age and education. Among children, significant and large age effects (ρs ranged from .55 to .72) were observed for all Cognition Battery measures after controlling for sex and mother’s education, with older children performing better. Additionally, children demonstrated a positive, although weaker, small adjusted relationship (ρ = .24) between scores on the Dimensional Change Card Sort Test and mother’s education after controlling for age and sex. Children of mothers with greater educational attainment performed better. Finally, females performed better than males on the Picture Vocabulary Test based on a small effect (d = .28; see Table 9).
Effect Sizes for Comparisons of NIH Toolbox Cognition Battery Scores Among Demographic Groups.
Note. Educ = Education; DCCS = Dimensional Change Card Sort Test; Flanker = Flanker Inhibitory Control and Attention Test; List Sort = List Sorting Working Memory Test; PSM = Picture Sequence Memory Test; Pattern Comp = Pattern Comparison Processing Speed Test; PVT = Picture Vocabulary Test; ORRT = Oral Reading Recognition Test; F = female; M = male. ns reflect number of participants included in demographic comparison analyses for each NIH Toolbox measure.
Adjusted for sex and mother’s education. bAdjusted for age and sex. cAdjusted for age and mother’s education. dAdjusted for sex and education. eAdjusted for age and education.
p < .001.
Discussion
This study evaluated the reliability and validity of the Spanish-language version of the NIH Toolbox among children aged 3 to 7 years and adults aged 18 to 85 years who participated in the initial norming study. Such information is crucial to the success of researchers working with and attempting to interpret NIH Toolbox scores in this population. Relative to English-language tests, there are relatively few well-validated Spanish-language testing batteries available. Given the ongoing and rapid expansion of the Latinx population in the United States (U.S. Census Bureau, 2018a, 2018b), the Spanish-language version of the NIH Toolbox is poised to be a particularly valuable tool for assessing neurological and behavioral functioning among primary Spanish speakers.
Sensation Battery
For the Sensation Battery, test–retest reliability was strong for vision among both children and adults, and moderate for both olfaction and taste among adults. This is consistent with the English-language versions of the measures (see Supplemental Table 1, available online; Dalton et al., 2013; Rawal et al., 2015; Varma et al., 2013). English-language normative data are not available for the NIH Toolbox assessment of pain intensity because this measure was derived from the Patient-Reported Outcomes Measurement Information System® (PROMIS®), which underwent its own separate norming process. However, the test–retest reliability for this measure was somewhat lower for the present Spanish-language sample as compared with a general population sample of 100 individuals who completed the measure in English (Broderick et al., 2013). This may be because the single-item pain assessment is structured to assess a 7-day recall period, and the average time span between administrations for the Sensation Battery was greater than 8 days. Differences in reliability were also found for the Words-in-Noise Test, with stronger support having been found for the English-language version of the measure (Wilson & Burks, 2005), although this may be a function of low sample size in the present study. Finally, measures in the Sensation Battery generally related to age and sex in a manner consistent with prior effects found in English-speaking samples. For example, Friedman et al. (2004) found that older age was associated with increased prevalence of age-related macular degeneration in the U.S. population, consistent with the present finding that performance on the Visual Acuity Test decreased with age. Similarly, Kaneda et al. (2000) found that odor discrimination abilities decreased with age, and both Kaneda et al. and Mojet et al. (2001) found that taste decreased with age. Wilson (2011) found an inverse relationship between age and performance on the Words-in-Noise Test in English. Finally, reported pain has also been shown to increase with age (Johannes et al., 2010). Thus, the present results provide support for the research-related use of the Sensation Battery to assess sensory functioning among Spanish-speaking adults residing in the United States. Additional research with larger sample sizes is needed to effectively evaluate the utility of this battery among Spanish-speaking children.
Motor Battery
For the Motor Battery, test–retest reliability was poor for the 4-meter walk test of locomotion. This measure also performed the worst of all measures in this domain among English-speaking participants in the normative sample (see Supplemental Table 1, available online; Reuben et al., 2013). A recently published analysis of this test conducted with the combined language adult sample from the NIH Toolbox norming study also identified disconcertingly low test–retest reliability for this measure (ICC = .41; Bohannon & Wang, 2019). It may be that the poor reliability is at least in part a function of human error, as the test is hand timed by an administrator and trial length is generally relatively quick. Automated timing may improve these results, although it is generally more burdensome and less practical than hand timing. Additionally, as was observed with the Sensation Battery, measures in the Motor Battery related to age and sex in a manner consistent with prior effects found in other samples. Similar to the present findings, published work conducted with children has reported greater grip strength and dexterity with increased age, and greater grip strength but worse dexterity in boys as compared with girls (Ervin et al., 2014; Omar et al., 2018). Greater grip strength has also been reported in adult men as opposed to adult women (Yorke et al., 2015). Also in adults, decreased dexterity has been found with increasing age (Ruff & Parker, 1993). Deficits in balance have also been observed for adult women relative to men (Wolfson et al., 1994), and have been found to worsen with increasing age in adults (Kalisch et al., 2011) and improve with age in children (Butz et al., 2015). Finally, endurance has been shown to be greater for men as compared with women, and to decrease with age among adults (Gibbons et al., 2001) but increase with age among children (Bohannon et al., 2018). Interestingly, Öberg et al. (1993) reported decreases in gait speed and step length with increasing age and for women as opposed to men; however, no relationships between sociodemographic variables and 4-meter walk time were found in the present study. These associations with demographic variables provide preliminary support for the validity of the Motor Battery for use in research with Spanish-speaking children and adults living in the United States; however, additional research with larger samples is needed to more confidently assess the psychometric strength of this battery for use with this population.
Emotion Battery
The internal consistency reliability of the fixed form measures included in the Emotion Battery was supported by both Cronbach’s alpha values and IRT reliability statistics. The computed values were similar to the Cronbach’s alpha values reported for the English-language version of the NIH Toolbox Emotion Battery, for which alpha values ranged from .83 to .97 among adults and .73 to .92 among children (Salsman et al., 2013). Among adults, no relationships between measures in the Emotion Battery and sociodemographic variables were found. This is generally consistent with Babakhanyan et al. (2018), who evaluated the Emotion Battery data from the NIH Toolbox norming study in both English and Spanish. These authors explored the relationships of scores to the combined effects of numerous sociodemographic variables (i.e., age, education, gender, ethnicity, and household income), and found few significant relationships with only small effect sizes in both languages (English R2 ranged from .005 to .048, Spanish R2 ranged from .017 to .033). Of note, these authors used a less stringent alpha value of .01, while the present analysis imposed a limit of .001, which may explain the presence of significant relationships in their results but not the present results. In the present study, only three measures in the Emotion Battery were related to age and none to sex among children. However, notably more associations with age and sex were found among children in the English-language normative sample (Paolillo et al., 2018). This may be a function of larger sample size in the English study, thus yielding greater power to identify relationships. Additionally, Paolillo et al. (2018) did not impose any corrections to control for family-wise error, and they included a broader age range for many of the parent-report measures included in the Emotion Battery (i.e., ages 3 to 12 years). That is to say, scores of children aged 8 to 12 years, who were only represented in the English study, may have driven the relationships of scores on measures within the Emotion Battery to sociodemographic variables.
In aggregate, although the Spanish-language Emotion Battery self-report and parent-report measures do not appear to be robust to commonly encountered challenges in the assessment of emotion (e.g., reliable parent-report assessment), these results provide preliminary support for the use of the Emotion Battery in research with Spanish-speaking individuals in the United States.
Cognition Battery
For the Cognition Battery, test–retest reliability was strong across both adults and children, as was observed with the English language subset of the normative sample (see Supplemental Table 1, available online; Akshoomoff et al., 2013; Bauer & Zelazo, 2013; Heaton et al., 2014; Weintraub et al., 2013). Convergent validity of the language measures included in the NIH Toolbox Cognition Battery was strong among adults, further supporting the utility of these measures in Spanish-language adult respondents. Among children, good convergent validity was found for the Picture Vocabulary Test, although it was slightly lower for the Oral Reading Recognition Test. This may be a function of lower sample size, as the Oral Reading Recognition Test is only appropriate for respondents aged 7 years and older. Moreover, as was observed among English speakers, and consistent with expectations, performance on the Cognition Battery was generally associated with younger age and higher education among adults (Heaton et al., 2014; Weintraub et al., 2013), and with older age among children (Akshoomoff et al., 2013; Bauer & Zelazo, 2013). Unlike in the English-language comparison sample, mother’s education level was generally not significantly related to performance on the Cognition Battery among Spanish-speaking children; however, this may be due to the fact that children aged 8 to 15 years were included in the analysis in English (Akshoomoff et al., 2013). Additionally, in the English-language normative sample of children, sex was not significantly related to scores on the Picture Vocabulary Test (Akshoomoff et al., 2013), while it was significantly related, albeit weakly, to scores on this measure in Spanish. Although the causes of sociodemographic differences in cognitive test performance remain poorly understood (Heaton et al., 2014), it is important to note that these relationships are not equivalent for the English- and Spanish-language versions of these tests. This highlights the importance of considering language groups differently when establishing normative standards and evaluating changes in cognitive functioning over time, as the impact of these variables on scoring is likely to differ across language groups. Despite these differences, in total, the present results provide support for the use of the Cognition Battery in research among both children and adults living in the United States who speak Spanish as their primary language.
Limitations and Future Directions
The present study has limitations. Nonnormality of the data required nonparametric analytic approaches, which complicates comparison of findings with prior published studies. Additionally, while the present sample is representative of the areas in the United States where recruitment facilities were located, it is not necessarily representative of all Spanish-speaking individuals living in the United States.
Despite these limitations, the present results provide important support for the use of the Spanish-language version of the NIH Toolbox in research to measure sensory, motor, emotional, and cognitive functioning among Spanish-speaking individuals living in the United States. While the original goal of the NIH Toolbox was to develop measures for use in research studies, there are now numerous published articles providing clinical validation evidence for the English-language version. Most commercially available tests would use this type of evidence to also impute the clinical use of their tests in translations. It is likely that the clinical appropriateness of the NIH Toolbox measures extends to the Spanish version as well. Individual clinicians and clinical researchers are encouraged to determine the appropriateness of the Spanish-language version of the NIH Toolbox for their own needs, and where possible to conduct research to directly clinically validate the Spanish-language measures. Pending appropriate identification of such support, the Spanish-language version of the NIH Toolbox is likely to be useful in tracking outcomes in clinical and epidemiological research across the life span, and in diverse samples including mixed-language participants.
Supplemental Material
Spanish_NIHTB_Reliability_and_Validity_RR2_Supplemental_Table – Supplemental material for Reliability and Validity of the Spanish-Language Version of the NIH Toolbox
Supplemental material, Spanish_NIHTB_Reliability_and_Validity_RR2_Supplemental_Table for Reliability and Validity of the Spanish-Language Version of the NIH Toolbox by Rina S. Fox, Jennifer J. Manly, Jerry Slotkin, John Devin Peipert and Richard C. Gershon in Assessment
Footnotes
Acknowledgements
The authors would like to thank Jennifer Beaumont for providing additional details to facilitate the development of this article. Additionally, the authors would like to thank the participants of the NIH Toolbox norming study for their important contributions.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study is funded in whole or in part with Federal funds from the Blueprint for Neuroscience Research, NIH, under contract number HHS-N-260-2006-00007-C, and by the Environmental Influences on Child Health Outcomes (ECHO) program, Office of the Director, NIH, under award number U24OD023319. PROMIS, Patient-Reported Outcomes Measurement Information System, and NIH Toolbox for Assessment of Neurological and Behavioral Function are marks owned by the U. S. Department of Health and Human Services.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
