Abstract
Temperament is a core aspect of children's psychological functioning and is assumed to be at least somewhat stable across childhood. However, little research has assessed the stability of temperament from early childhood to early adolescence. Moreover, few studies have examined the influence of measurement and analytic methods on the stability of early temperament over periods of more than a few years. We obtained laboratory observations and mother and father reports of temperamental negative and positive emotionality and effortful control from 559 three–year–olds. Approximately nine years later, children and both parents completed questionnaire measures of similar temperament constructs. Zero–order correlations revealed greater within–informant than cross–informant stability. In addition, compared with parent reports, early childhood laboratory measures showed greater convergent and divergent validity with child, mother, and father reports at age 12. Finally, latent temperament variables at age 3 composed of laboratory and parent–report measures and latent variables at age 12 composed of parent and child reports showed moderate stability. There was also a weak but significant association of early effortful control with later negative and positive emotionality. Results have implications for assessing temperament and knowledge of the stability of temperament across childhood. Copyright © 2018 European Association of Personality Psychology
Temperament, or relatively stable individual differences in emotional reactivity and regulation, has long been recognized as a core component of children's psychological makeup (Rothbart, Ahadi, & Evans, 2000; Zentner & Bates, 2008). It is also a key determinant of children's psychosocial functioning including psychopathology (Klein, Dyson, Kujawa, & Kotov, 2012; Nigg, 2006), academic performance (Martin, 1994), and peer relationships (Sanson, Hemphill, & Smart, 2004). A core assumption underpinning conceptualizations of temperament is that individual differences in rank–ordering on traits are relatively stable over time. However, it is also now recognized that temperament can show substantial rank–order change over time.
Understanding the degree and nature of this change in temperament over time is important for several reasons. Until recently, there was long–standing debate over the stability of temperament in terms of whether it was primarily environmentally or genetically influenced (see Caspi, Roberts, & Shiner, 2005; Ferguson, 2010; Roberts & DelVecchio, 2000) although it is now generally recognized that both play a role and that there should be a moderate degree of both stability and change over time in temperament (Roberts & DelVecchio, 2000; Roberts & Mroczek, 2008). However, the degree of this stability varies depending on the developmental window examined, and rank–order stability from early childhood to early adolescence has been understudied. This issue is particularly important in children, who are sometimes assumed to show high rank–order levels of stability. For example, some may assume that a child prone to anger will likely always be more prone to anger than others their age.
The stability of temperament speaks to the heart of how of temperament is conceptualized and measured. For instance, it can inform knowledge about whether there are likely effects of developmental predictors of rank–order stability, and whether these likely have effects in specific developmental windows. For instance, many researchers have examined the effects of parenting on rank–order change in temperament over time (e.g. Kopala–Sibley et al., 2017; see Kiff, Lengua, & Zalewski, 2011; Lipscomb et al., 2011). However, if temperament shows substantial rank–order stability across childhood, this diminishes the likelihood of finding such effects during this period. Understanding the stability of temperament also has implications for identifying those at risk for negative outcomes early on. For example, given that negative emotionality is proximally related to depression and other psychological difficulties (Klein et al., 2012), a child who is at risk based on their levels of negative emotionality may not remain at risk relative to others if their temperament is not stable over time. Finally, as Ferguson (2010) notes, this issue is not just relevant to psychological research and practice. There is a widespread belief that temperament is highly stable over childhood, which often leads to the assumption that any effort to change maladaptive temperament in a child is in vain.
While many studies have examined the stability of temperament in infancy and from early to middle childhood (e.g. Carranza, González–Salinas, & Ato, 2013; Durbin, Hayden, Klein, & Olino, 2007; Dyson et al., 2015; Komsi et al., 2008; Komsi et al., 2006; Rothbart, Derryberry, & Hershey, 2000), few have examined the longer term stability of temperament from early childhood to early adolescence. Such information is crucial to our understanding of the extent to which early temperament is stable over periods of more than a few years. Moreover, as the period from childhood through adolescence is characterized by critical changes in key psychological processes involved in individual differences (e.g. self–concept and identity, emotional reactivity, and executive functioning) (Rothbart, Ahadi, et al., 2000; Sanson et al., 2004) and by evolving developmental pressures from the environment (e.g. Bakermans–Kranenburg, Van Ijzendoorn, & Juffer, 2003), it is important to document how individual differences in traits manifest and change across this developmental span.
Another issue is that the vast majority of studies have relied exclusively on parent reports of child temperament (e.g. Lemery, Goldsmith, Klinnert, & Mrazek, 1999; Pedlow, Sanson, Prior, & Oberklaid, 1993; Roberts & DelVecchio, 2000; Rothbart, Derryberry, et al., 2000). Fewer studies have used other assessment methods, such as laboratory and home observations. As parent reports are only modestly associated with lab–based and home–based observations of child temperament (Durbin et al., 2007), it is important to examine stability using multiple assessment approaches. Moreover, even less is known about the relationships of parent reports and observational measures of early childhood temperament with later child self–reports of traits (Mangelsdorf, Schoppe, & Buur, 2000).
Finally, rank–order stability in this literature is typically assessed via test–retest zero–order correlations, which may be attenuated due to measurement error at each time point while also inflated via shared method variance in terms of both informant and self–report methods (Ferguson, 2010; Roberts, Caspi, & Moffitt, 2001). Studies examining the stability of latent temperament traits have been limited to toddlerhood or early to middle childhood (e.g. Dyson et al., 2015; Komsi et al., 2006, 2008; Neppl et al., 2010; Pedlow et al., 1993). Only one study of which we are aware has examined the rank–order stability of latent temperament traits based on both parent–report and lab–based observations (Majdandžić & Van Den Boom, 2007), and that was limited to a seven–month assessment interval with 94 young children. To address these limitations and provide a comprehensive examination of the stability of temperament from early childhood to early adolescence, we examined the rank–order stability of lab–based observations of temperament as well as mother and father reports in a large sample of three–year–olds with mother, father, and child reports of temperament in early adolescence.
Models of the structure of temperament
The content of temperament measures necessarily changes across developmental periods such that indicators in early childhood will be different from those in early adolescence (Neppl et al., 2010). As such, it is necessary to measure developmentally appropriate manifestations of the same traits at different ages. Theoretical models that specify core aspects of temperament are necessary to guide such work. The three–factor model proposed by Rothbart and colleagues (Rothbart, 1981; Putnam, Ellis, & Rothbart, 2001) is arguably the most widely used contemporary model of childhood temperament (De Pauw & Mervielde, 2010; Zentner & Bates, 2008). It comprises the higher order factors of surgency, including high–intensity pleasure, activity, impulsivity, and shyness (reversed); negative emotionality, including discomfort, anger/frustration, sadness, fear, and soothability (reversed); and effortful control (or constraint), including inhibitory control, attentional focusing, perceptual sensitivity, and low–intensity pleasure (Rothbart, Ahadi, Hershey, & Fisher, 2001). This model maps fairly closely onto taxonomies of personality and temperament derived from studies of adults (Caspi et al., 2005; Rothbart, Ahadi, et al., 2000), the best well–known of which are the five–factor and three–factor models (John, Naumann, & Soto, 2008; Markon, Krueger, & Watson, 2005; Tackett, Krueger, Iacono, & McGue, 2008; Watson, Clark, & Harkness, 1994). Tellegen's (1985) influential three–factor model comprises positive emotionality, or a tendency towards positive affect and engagement with the environment; negative emotionality, or a propensity towards anxiety, anger, and fearfulness; and effortful control/constraint, or, at the opposite pole, disinhibition, or a tendency toward impulsivity, risk–taking, and unconventional behaviour. These higher order trait dimensions encompass and explain many of the narrower trait dimensions from the different models of child and adult temperament (Caspi et al., 2005; De Pauw & Mervielde, 2010; Markon et al., 2005; Zentner & Bates, 2008).
Temperamental stability in younger children
Although it is somewhat unclear how rank–order stability of temperament in early childhood will extend to adolescence, shorter term studies can inform expectations (Roberts & DelVecchio, 2000). A meta–analysis of longitudinal studies found average respective retest stabilities of .32, .52, and .45 across all traits in the periods from birth to 3 years, 3 to 6 years, and 6 to 12 years old (Roberts & DelVecchio, 2000). However, these estimates were not adjusted for measurement error, relied primarily on parent reports of childhood temperament, and did not examine stability from early childhood to early adolescence.
More recent studies have reached similar conclusions about the rank–order stability of temperament from early to middle childhood. Several studies have examined the stability of temperament via latent variables and reported modest (∼.20–.35) to moderate (∼.35–.50) stabilities from early to middle childhood for negative emotionality, positive emotionality, and effortful control (e.g. Carranza et al., 2013; Durbin et al., 2007; Dyson et al., 2015; Komsi et al., 2008; Komsi et al., 2006; Neppl et al., 2010; Rothbart et al., 2001). Most of these studies, too, have similar limitations as those in Roberts and DelVecchio's (2000) meta–analysis in that they rely exclusively on one method such as parent reports (Komsi et al., 2008; Komsi et al., 2006; Neppl et al., 2010; Rothbart et al., 2001) or lab reports (Dyson et al., 2015), do not adjust for measurement error by creating latent variables (Rothbart et al., 2001), and/or examine stability over relatively limited periods of development (Carranza et al., 2013; Durbin et al., 2007; Dyson et al., 2015; Komsi et al., 2008; Komsi et al., 2006; Rothbart et al., 2001).
We are aware of only two studies that have followed a sample from early childhood to early adolescence or beyond. Guerin and Gottfried (1994) found that parent reports of child temperament showed non–significant to moderate stability from age 2 to 12 years. Specifically, examining lower order facets of temperament, they reported test–retest correlations ranging from .00 for adaptability and .09 persistence to .23 (mood) to .30–.40 (activity, approach, intensity, and distractibility). However, they did not incorporate child reports or behavioural assessments or create latent variables to adjust stability estimates for measurement error. Similarly, in what is to our knowledge the longest term study to examine continuity of early childhood temperament, Caspi and Silva (1995) used observers’ ratings of temperament in three–year–old children completing cognitive and motor tasks to cluster children into groups and found significant differences between subgroups on self–reports of positive and negative emotionality and constraint (versus disinhibition) at age 18 years. However, as noted by Caspi and Silva (1995) and Caspi, Henry, McGee, Moffitt, and Silva (1995), it is unclear whether these motor and cognitive tasks adequately elicited the range of behaviours of interest. Moreover, they did not use constructs derived from contemporary models of temperament, making it difficult to compare their findings with the current literature. For instance, their ‘inhibited’ cluster of children showed a range of behaviours that are consistent with low positive emotionality, high negative emotionality, and high effortful control. At the age 18 follow–up, they relied solely on participant reports, whereas parent reports may still have provided useful information about their late adolescent children's temperament. Analyses in both studies also did not adjust for measurement error, and while the long–term follow–up in their study is a significant strength, it is unclear how results would extend from early childhood to preadolescence rather than late adolescence. Finally, neither of these studies integrated standardized lab–based assessments of early childhood temperament with parents’ reports. Thus, knowledge of the rank–order stability of temperament from early childhood to early adolescence is surprisingly limited.
The current study builds upon previous work by integrating, at age 3, standardized lab–based assessments of temperament as well as widely used questionnaires from both parents, both of which are grounded in well–established contemporary models of temperament. Further, we integrated child, mother, and father reports in early adolescence using a widely used, well–validated measure of temperament that was designed to map on to the most widely accepted three–factor models of child temperament.
Integrating multiple sources of information to study the stability of childhood temperament
Parents’ reports are immensely valuable because they draw on observations over extensive periods of time and in a variety of contexts. However, they may be confounded by a variety of reporting biases (Christensen, Margolin, & Sullaway, 1992; Durbin & Wilson, 2012; Jensen, Traylor, Xenakis, & Davis, 1988; Webster–Stratton, 1988; Youngstrom, Izard, & Ackerman, 1999), and stability may be inflated by shared method variance and stability of parental perceptions rather than child behaviour. Another approach to assessing temperament uses lab–based tasks designed to evoke the affects and behaviours characterizing different temperament traits (Goldsmith, Reilly, Lemery, Longley, & Prescott, 1995). However, this approach also has potential limitations, such as concerns about ecological validity, although parents consider their child's responses during laboratory temperament assessments as highly typical of their behaviour outside the laboratory (Lo, Vroman, & Durbin, 2015). There may also be transient influences such as mood states as well as restrictions in the range of affect and behaviour elicited in the child. As parent reports are only modestly associated with observational measures of child temperament, it is likely that both approaches provide unique perspectives (Durbin et al., 2007; Mangelsdorf et al., 2000).
In older youth and adults, temperament is often assessed via self–report. Self–reports can provide critical information about traits often unknown to other informants, such as parents and observers. Moreover, adolescent self–report instruments can map directly onto measures of adult personality and temperament, potentially providing a bridge between assessments of early childhood and adult temperament.
We propose that the most informative and comprehensive approach to understanding the stability of temperament is to integrate lab–based observations with both mother and father reports in early childhood and to examine their association with mother, father, and child reports in early adolescence. Each method provides important and informative but potentially incomplete information regarding children's temperament. This approach will also minimize shared method variance, thereby yielding more accurate estimates of the true rank–order stability of temperament over this span of development. Including both parents’ reports of their child's temperament also mitigates parent–specific biases. Lab–based measures also provide an objective measure, whereas parents often have no comparison or norms by which to answer questionnaires. By measuring temperament via latent constructs at each age, analyses also avoid attenuation of stability estimates because of measurement error. Finally, a latent variable modelling approach also permits the extraction of variance shared between each parent's report and lab–observations at age 3 and mother, father, and child reports at age 12, presumably providing a more valid measure of temperament.
There are, however, potential limitations to this approach. It is assumed in structural equation modelling (SEM) that the non–shared variance that is not included in the latent variable is measurement error, rather than variance in the true temperament construct. However, it is also possible that the non–shared variance provides valid and informative informant–specific or method–specific information. That is, while SEM is widely considered an appropriate way to assess the variance in an underlying construct that is imperfectly measured by several observed constructs, there may be valid and important variance that is lost by this approach. In addition, the present study examined only two time points and, because of the lengthy developmental span, was forced to use different measures at each occasion. Hence, it is limited to examining rank–order stability. Identical measures and three or more observations would be required for examining growth trajectories over time.
Using confirmatory factor analysis (CFA) to construct temperament and personality models with multiple latent factors can also be challenging (Hopwood & Donnellan, 2010). Exploratory factor analyses of multi–factorial personality measures typically yield a good fit to the data as they allow cross–loadings of items and scales across multiple latent factors. However, because of these cross–loadings, CFA approaches to modelling multi–factorial personality measures typically show a poor fit to the data (Hopwood & Donnellan, 2010; Marsh, Hau, & Wen, 2004; Marsh, Scalas, & Nagengast, 2010). More complex models that explicitly incorporate latent methods factors (e.g. Podsakoff, MacKenzie, Lee, & Podsakoff, 2003) often fail to identify or converge. As such, in the current study, we constructed latent variables in separate models for each of the Big Three temperament traits.
Measurement of temperament in early adolescence
Given that the same measures of particular temperament traits cannot be used across widely spaced developmental periods, it is necessary ensure that the different measures used validly assess those constructs. In the current study, at the age 12 assessment, the Schedule for Non–adaptive and Adaptive Personality for Youth (SNAP–Y; Linde, Stringer, Simms, & Clark, 2013) was administered. The adult SNAP and SNAP–Y temperament scales were derived from the General Temperament Survey (Watson & Clark, 1992), which was based on Tellegen's (1985) three–factor model of temperament, and contain the three factors of negative emotionality, positive emotionality, and disinhibition, with the latter mapping onto effortful control (Rothbart & Bates, 2006). Clark and Watson (2008) have shown that the Big Three factors meet all the criteria generally used to define temperament, in that they refer to patterns of affect and behaviour that are relatively stable over time and are largely heritable but also influenced by the environment. Moreover, Rothbart and Bates (2006) and Shiner and Caspi (2003) note the substantial continuity between three–factor models of temperament in children, as examined in the current study, and three–factor models of personality and temperament in adults. These theorists argue that these three temperament dimensions are the foundation for later personality and ultimately differentiate over development in a broader array of personality traits, such as the Big Five (Ready & Clark, 2002; Rothbart & Bates, 2006).
Overview and hypotheses
The goal of this study was to examine the stability of childhood temperamental negative emotionality, positive emotionality, and effortful control/constraint over a nine–year period from age 3 years to age 12. First, we examined zero–order correlations of each trait across different measurement methods and informants. We expected to find the strongest convergence when assessed both within–informant and within–construct. We also expected to observe significant but weaker convergence when the same construct was assessed via parent report but across mothers versus fathers. We expected the lowest, but still significant, convergence between lab–based measures at age 3 and parent reports at age 12. Given the lack of prior research on this topic, we had no a priori hypotheses for whether lab versus parent reports at age 3 would best predict age 12 child reports. Analyses also examined whether age 3 constructs correlated more strongly with their corresponding age 12 counterpart than with non–matching temperament traits (i.e. discriminant validity).
Second, we constructed latent temperament constructs at age 3 based on mother and father reports and lab–based observations. We also constructed latent age 12 temperament constructs based on mother, father, and child reports. We then examined rank–order stability between these latent constructs. We expected these stability coefficients to be moderate in strength but higher than the bivariate test–retest correlations. By incorporating multiple informants and multiple methods and following a large cohort of three–year–old children over a nine–year period, the current study represents the most comprehensive examination of the stability of early childhood temperament of which we are aware.
Methods
Open materials and open data statement
Although the data have not been deposited as they are being used in ongoing projects, relevant data, procedures, and syntax for models are available upon request from Dr. Klein.
Participants
Participants were 559 three–year–old children and their parents living in Long Island, New York, who were recruited as a part of a longitudinal study of children's temperament (see Olino, Klein, Dyson, Rose, & Durbin, 2010 for details). The mean age of the children at baseline was 43.5 months (SD = 2.8).
In 2004–2007, participants were recruited through a commercial mailing list and screened by phone. Eligible children had no significant medical problems or developmental disabilities and had at least one English–speaking biological parent who could participate. Most children were European American and non–Hispanic (86.9%), lived with both biological parents (94.6%), and came from middle–class families, as measured by the Four Factor Index of Social Status (M = 45.33, SD = 10.99; Hollingshead, 1975). Families in this study had an average annual household income of approximately $100 000, which is highly comparable with the average household income in the geographic area from which this sample was drawn.
The effective sample size was based on the number of children who completed the Laboratory Temperament Assessment Battery (Lab–TAB) at baseline. Of the 559 families whose children completed the Lab–TAB at baseline, 41 mothers and 158 fathers did not complete parent–report temperament measures. Experimenters also rated participants’ traits across the entire lab visit (‘Experimenter–Impressions’); these scores were unavailable for 22 children. Participants with missing data at baseline did not differ from those with complete data for mothers, fathers, and lab–based observations in terms of sex, socio–economic status, living in a two–parent home, race, ethnicity, or any measure of temperament at age 3 (all ps > .05). Participants were assessed again at age 12 (2014–2016, M age = 151.95 months, SD = 5.53). Of the 559 original participants, 423 children, 426 mothers, and 359 fathers completed questionnaires at the age 12 follow–up. Participants with missing data at the age 12 assessment did not differ from those with complete data at both time points on any variable assessed in this study (all ps > .05). Little's missing completely at random test also confirmed that missingness was not significantly related to any variable in our study: χ2 (436) = 482.103, p = .06. Data can thus be viewed as missing at random for analyses. Full information maximum likelihood (FIML) procedures in AMOS 22.0 were used to estimate the means and intercepts in the presence of missing data. This approach is generally acknowledged to be preferable to listwise deletion or mean imputation, which are more likely to yield biased estimates (e.g. Schafer & Graham, 2002). Our effective sample size in latent models was therefore 559, although sample sizes for zero–order correlations varied.
Procedure
At age 3, children participated in the Lab–TAB (Goldsmith et al., 1995). Mothers and fathers completed the Child Behaviour Questionnaire (CBQ; Rothbart et al., 2001). At age 12, mothers, fathers, and children completed the adaptive temperament subscales of the SNAP–Y (Linde et al., 2013). See Table 1 for a summary of the variables, measures, and constructs assessed at each time point.
Summary of variables used to measure theoretical constructs
Note: Lab–TAB, Laboratory Temperament Assessment Battery; SNAP–Y, Schedule for Non–adaptive and Adaptive Personality for Youth; CBQ, Child Behaviour Questionnaire.
Materials
Lab–based observations of temperament
At age 3, child negative emotionality, positive emotionality, and Impulsivity were coded by research assistants based on videotapes of the Lab–TAB as well as experimenter–impression ratings. Before coding independently, coders had to achieve 80% or higher agreement with an expert coder on all codes within an episode. Twelve age–appropriate tasks were used (Dyson et al., 2015; Olino et al., 2010). Most were adapted from tasks used in the developmental literature and were designed to elicit a range of temperament–relevant emotions and behaviours from the child (Gagne, Van Hulle, Aksan, Essex, & Goldsmith, 2011; Goldsmith et al., 1995). For a full description of all episodes at each assessment wave, see Appendix A.
During each task, each instance of children's bodily, vocal, and facial expressions of positive affect, sadness, anger, and fear was rated on a 3–point scale (i.e. low, moderate, and high intensity). Intensity ratings were then summed for each channel (i.e. facial, bodily, and vocal) within each episode. Following this, the intensity ratings were averaged within each channel across all episodes for each of the affective traits. Standardized scores for the three channels were then averaged to comprise the composite score for each affect. Interest/engagement was rated on a 4–point Likert scale based on each full episode and reflected the child's tendency to display interest in stimuli, ask questions, or make comments and their general level of engagement. Impulsivity/disinhibition was also rated on a 4–point Likert scale based on the entire episode and reflected the child's tendency to act or respond without reflection or hesitation, as indicated by quick changes in behaviour or shifts in attention. Impulsivity was reversed so that scores were in the same direction as effortful control subscales of the CBQ. Standardized ratings of sadness, anger, and fear were combined to create a total negative emotionality scale, and standardized ratings of positive affect and interest/engagement were composited to create a positive emotionality scale. Coefficient alphas for negative and positive emotionality were both .82, and the intraclass correlation coefficient for interrater reliability (n = 35) for negative and positive emotionality was .74 and .89, respectively. Alpha for impulsivity was .70, with an intraclass correlation coefficient of .75. The Lab–TAB shows moderate test–retest stability and construct validity in terms of associations with independent ratings by experimenters, unstructured home observations, and diagnostic interview assessments of child psychopathology (Dougherty et al., 2011; Durbin et al., 2007; Gagne et al., 2011).
The individual conducting the laboratory visit with the participant completed a set of global ratings about the child (Experimenter Impressions). These were adapted from the post–Lab–TAB rating scale developed by Gagne et al. (2011). The experimenter was with the child from the moment the family arrived on campus until they walked the participant and parent back to the parking garage after the assessment, including during the unstructured play breaks between each of the 12 Lab–TAB episodes. Thus, experimenter ratings are both more global and based on a larger sample of child behaviour. Experimenters rated the participant on 24 scales, 15 of which were used here, and from which we derived overall measures of negative emotionality, positive emotionality, and effortful control/constraint. Each variable was rated on a single 5–point Likert scale (1 = rarely, 2 = subtle or ambiguous signs, 3 = mild, 4 = moderate, 5 = extreme). We averaged ratings of relevant items to create temperament scales. Negative emotionality was composed of overall negative affect, fearfulness, frustration with tasks, anger or irritability, and sadness (Cronbach alpha = .79). Positive emotionality was composed of overall positive affect, interest in test materials and stimuli, enthusiasm toward tasks, initiative, and anticipatory positive affect (alpha = .89). Effortful control/constraint was composed of attention to tasks, persistence in completing tasks, and impulsiveness (alpha = .75). Interrater reliabilities are not available as there was only one experimenter per Lab–TAB assessment.
Parent report of child temperament at age 3
Mothers and fathers completed the CBQ (Rothbart et al., 2001) at baseline. The CBQ is currently the most widely used parent–report measure of early childhood temperament (Putnam, Gartstein, & Rothbart, 2006; Rothbart et al., 2001). It is a 195–item questionnaire, with each item rated on a 7–point Likert scale ranging from ‘extremely untrue of your child’ to ‘extremely true of your child’. Responses are averaged to create a total score. The present study used the higher order effortful control and negative affectivity scales. However, surgency, the third higher order factor on the CBQ, has broader content than the construct of positive emotionality. Therefore, we used the CBQ lower order facets of approach/anticipation and smiling/laughter as indicators of positive emotionality as these are the two CBQ subscales that most closely correspond to this trait. In prior research, these scales have shown good internal consistencies and test–retest reliability, and correlations with expected outcomes such as social behaviour (Rothbart et al., 2001), although it should be noted that recent CFA–based research has not supported the proposed structural properties of the CBQ (Kotelnikova, Olino, Klein, Kryski, & Hayden, 2016). In the current sample, alphas for each subscale used in this report ranged from .63 (sadness) to .79 (anger/frustration).
Age 12 temperament
Age 12 temperament was assessed via mother reports, father reports, and self–reports about the child on the adolescent version of the SNAP (Linde et al., 2013). The SNAP–Y is a 390–item factor–analytically derived measure with true/false response options. It contains three temperament scales, 15 lower order personality trait dimensions across the continuum from normal to abnormal personality functioning, and six validity scales. We administered only the three temperament scales: negative temperament (referred to here as negative emotionality), positive temperament (referred to here as positive emotionality), and disinhibition. Negative emotionality (28 items) assesses tendencies towards anger, sadness, fear, and distress (e.g. ‘I rarely get so angry that I lose control’, and ‘I often feel nervous and tense’). Positive emotionality (28 items) assesses tendencies towards positive affect, activity, enjoyment, and pleasure (e.g. ‘I lead an active life’, and ‘I get excited when I think about the future’). Disinhibition (35 items) assesses the extent to which behaviours are planned, thought through, and not impulsive (e.g. ‘I rarely, if ever, do anything reckless’, and ‘I never buy things on a whim or impulse’). Thus, these constructs correspond closely with the temperament constructs assessed at age 3 by the Lab–TAB and CBQ.
The SNAP and SNAP–Y show good convergent validity with a range of normal and abnormal personality traits and forms of psychopathology (e.g. Clark, McEwen, Collard, & Hickok, 1993; Linde et al., 2013; Melley, Oltmanns, & Turkheimer, 2002; Watson, 2000; Watson & Clark, 1992). In the current study, alphas for child, mother, and father reports of negative emotionality were .90, .88, and .89, respectively. For positive emotionality, child, mother, and father reports had alphas of .80, .85, and .86, respectively. For disinhibition, alphas for child, mother, and father reports were .80, .84, and .83, respectively. Correlations with other individual differences measures in this study support the convergent and discriminant validity of the SNAP–Y and are provided in Table S6.
The child and the parent attending the lab–visit completed the SNAP–Y on a computer during that lab visit, while the other parent completed it electronically at home. There were no time restrictions.
Data analyses
Analyses initially consisted of zero–order correlations between ages 3 and 12 temperament variables that were corrected for attenuation due to unreliability (Muchinsky, 1996), as measured by Cronbach alpha values of internal consistency. As noted by Ferguson (2010) in his meta–analysis, zero–order correlations as measures of the stability of temperament are attenuated by 19–26% because of unreliability. The formula for correcting for attenuation is rxy/ √ (rxx*ryy) where rxy is the raw correlation between x and y, rxx is the Cronbach alpha of x, and ryy is the Cronbach alpha of y. Uncorrected correlations and their two–tailed p–values are shown in Table 2, with the corrected correlations in Table 3. Adjusting a correlation for unreliability does not alter its p–value. These correlations examined the convergence and divergence of measures of temperament across methods, informants, and variables from ages 3 to 12. Where an age 3 trait correlated with multiple informants’ reports at age 12 on the same trait, or where multiple informants’ reports of a specific age 3 traits correlated with a particular age 12 score, correlations were z–scored and compared in order to test the significance of the difference in the strength of the associations. Significant differences between correlations are shown in Table 3, while results of all comparisons between correlations are shown in Tables S3–S5.
Raw correlations between ages 3 and 12 temperament variables
Note: Sample size (N) is included as this varied for each correlation. Values in parentheses under each correlation are p–values. Significant correlations (p < .05) are bolded. LT, Laboratory Temperament Assessment Battery; EI, Experimenter Impressions during the Laboratory Temperament Assessment Battery; NE, negative emotionality; PE, positive emotionality; EC, effortful control; CBQ, Child Behaviour Questionnaire; NA, negative emotionality; SNAP, Schedule for Non–adaptive and Adaptive Personality for Youth; NT, negative temperament; PT, positive temperament; DIS, disinhibition; Mom, mother–rated; Dad, father–rated; Child, child–rated.
Unreliability–adjusted correlations between ages 3 and 12 temperament variables
Note: Sample size (N) is included as this varied for each correlation. See Table 2 for exact p–values. LT, Laboratory Temperament Assessment Battery; EI, Experimenter Impressions during the Laboratory Temperament Assessment Battery; NE, negative emotionality; PE, positive emotionality; EC, effortful control; CBQ, Child Behaviour Questionnaire; NA, negative emotionality; SNAP, Schedule for Non–adaptive and Adaptive Personality for Youth; NT, negative temperament; PT, positive temperament; DIS, disinhibition; Mom, mother–rated; Dad, father–rated; Child, child–rated.
p < .01.
p < .05.
p < .10.
Structural Equation Modelling was carried out using AMOS 22.0. We examined the rank–order stability between temperament at age 3 measured via a latent variable indicated by mother and father CBQ reports, and Lab–TAB and experimenter impression scores, and at age 12 via a latent variable indicated by mother, father, and child reports on the SNAP–Y. Thus, ages 3 and 12 temperament were modelled simultaneously. At age 3, latent negative emotionality was indicated by mother and father reports of CBQ negative affectivity as well as negative emotionality from the Lab–TAB and Experimenter Impressions. Latent positive emotionality was indicated by mother and father reports of smiling–laughter and approach–anticipation on the CBQ, and Lab–TAB and Experimenter Impressions positive emotionality. Effortful control was indicated by mother and father reports of effortful control on the CBQ with Lab–TAB impulsivity (reversed) and Experimenter Impressions effortful control/constraint. At age 12, latent variables of negative emotionality, positive emotionality, and disinhibition were each composed of mother, father, and child reports on the relevant manifest indicator. Given that ratings from Experimenter Impressions and the Lab–TAB were based on overlapping samples of behaviour, error terms on Lab–TAB and Tester–Impressions variables were covaried within each construct a priori in each model.
Ideally, the interrelationships between all three latent factors at both ages 3 and 12 should be examined in the same model. However, consistent with prior findings that CFA approaches to modelling the structure of personality or temperament typically show a poor fit to the data (Hopwood & Donnellan, 2010; Marsh et al., 2004; Marsh et al., 2010), a three–factor model of temperament at age 3 including covariances on error terms within each construct of lab–based variables showed a poor fit to the data, χ2 (71) = 694.70, p < .001, comparative fit index (CFI) = .70, root mean square error of approximation (RMSEA) = .13, 90% confidence interval (CI) [0.12, 0.13].
We also attempted to model all three age 3 temperament traits in one model following Podsakoff's proposed multi–trait multi–method CFA (see tables 4 and 5 in Podsakoff et al., 2003). Latent variables were indicated by separate mother, father, and lab–based scores, thereby portioning variance attributable to the informant or method into separate latent factors, while retaining separate latent factors for negative emotionality, positive emotionality, and effortful control (Figure S1). We also computed a similar model but with only two latent methods factors, one for lab–based variables and another for all parent reports (Figure S2). However, consistent with the Podsakoff et al. (2003) observation that the most frequent problem with these models is that they do not identify, neither model identified. Others have also noted frequent convergence problems with these models (Kenny & Kashy, 1992).
We also computed Podsakoff's correlated uniqueness model with all three traits at age 3 (table 4 in Podsakoff et al., 2003). This model accounts for methods effects by allowing error terms of specific measurement methods to be correlated. Thus, covariances were included between all parent–report variable error terms and between all lab–measured variable error terms (Figure S2). Again consistent with the Podsakoff et al. (2003) point that these models often have identification problems, this age 3 model failed to identify. A three–factor model of SNAP–Y scales at age 12 similarly showed a poor fit to the data, χ2 (24) = 176.23, p < .001, CFI = .84, RMSEA = .11, 95% CI [0.09, 0.12]. Given that we were unable to model all three latent factors in one model at age 3, we did not compute similar models with age 12 variables. Therefore, we examined the stability of each trait in separate models.
In order to examine heterotypic continuity, we imputed factor score estimates from the model for each of the latent variables described earlier based on the factor loadings from our models that contained only one age 3 and one age 12 latent variable. That is, negative emotionality at ages 3 and 12, positive emotionality at ages 3 and 12, and effortful control at ages 3 and 12 were imputed separately, and each was used as a manifest (observed) variable in our models examining heterotypic continuity. AMOS uses regression imputation with FIML for estimating missing data. This is identical to applying the factor score weights to each individual's scores on the observed indicators. Estabrook and Neale (2013) note that when data are missing at random, FIML more accurately estimates individual factor scores with missing data compared with sum scores, mean imputation, or regression estimators that do not use FIML. We then tested a path model in which each age 12 latent temperament variable (i.e. negative and positive emotionality and disinhibition) was simultaneously regressed on all three age 3 latent temperament variables (i.e. negative emotionality and positive emotionality and effortful control). Thus, this analysis examined whether each age 3 trait predicted each age 12 trait, adjusting for that trait's respective age 3 levels.
We are not aware of any studies that have used multiple methods and informants and created factor scores estimates of the latent variables based on these observed indicators in separate models. However, it is routine for constructs to be created based on averages of either parent reports or behavioural episodes, and then use those to examine heterotypic stabilities of over time (e.g. Majdandžić & Van Den Boom, 2007; Pesonen, Räikkönen, Keskivaara, & Keltikangas–Järvinen, 2003; Putnam, Rothbart, & Gartstein, 2008). Komsi et al. (2006) examined heterotypic continuity of latent temperament constructs, but these constructs did not include observational measures as indicators, and so they were able to fit multi–factorial temperament models. As such, we are not aware of any studies that have taken an approach directly comparable with ours.
As measures of goodness of fit, we present chi–square, CFI, and RMSEA as well as the 90% CI around the RMSEA. Generally, CFI values greater than .90 (Hoyle & Panter, 1995) and an RMSEA less than .08 (Kline, 1998) indicate acceptable fit. Given that we expect a medium effect size in our latent stability models both in terms of loadings of observed indicators on latent variables as well as the relationships between ages 3 and 12 temperament traits, power analyses revealed that with two latent variables and nine indicators, which is the upper number of indicators in our models, a minimum sample of 90 would be required (Soper, 2017; Westland, 2010). Regression weights in our latent models are fully standardized and therefore represent effect sizes.
Results
Zero–order convergent and discriminant correlations
Descriptive statistics and within–time correlations are available in Tables S1 and S2. Zero–order correlations between age 3 lab–based and parent–reported temperament traits and age 12 mother, father, and child–reported traits are shown in Table 2, and associations unadjusted for unreliability are presented in Table 3.
Negative emotionality correlations
Consistent with hypotheses, correlations were generally strongest between parent reports of negative emotionality at ages 3 and 12 compared with correlations of age 3 lab–based negative emotionality with age 12 negative emotionality, with some evidence of stronger correlations within compared with across informants (Tables 4, and S3). Also consistent with expectations, most correlations were modest in strength. Compared with parent reports of negative emotionality at age 3, Lab–TAB and Experimenter Impressions negative emotionality converged as well or better with child–reported negative emotionality at age 12. Lab–TAB negative emotionality also correlated negatively with mother–reported age 12 positive emotionality, while mother–reported age 3 negative emotionality correlated positively with mother–rated age 12 disinhibition.
Significant r–to–z zero–order correlation comparisons
Note: Each column indicates which correlation was used. Bolded correlation indicates which correlation was significantly larger. Positive emotionality correlation comparisons not listed here because none were significant. See Table 3 for correlations used to compute values in this table. LT, Laboratory Temperament Assessment Battery; EC, effortful control; Imp, impulsivity; EI, Experimenter Impressions; CBQ, Child Behaviour Questionnaire; NA, negative affectivity; SNAP–Y, Schedule for Non–adaptive and Adaptive Personality for Youth; NT, negative temperament; DIS, disinhibition; Mother, mother–rated; Father, father–rated; Child, child–rated.
Positive emotionality correlations
Also consistent with expectations of stronger within–informant than across–informant correlations, parent reports of positive emotionality at age 3 generally only converged with their own reports of positive emotionality at age 12 (Tables 4, and S4). These associations were also modest in strength. In contrast, Lab–TAB positive emotionality converged with both mother and father reports of positive emotionality at age 12. Interestingly, Lab–TAB positive emotionality showed somewhat better discriminant validity in that it was related only to age 12 positive emotionality variables, whereas parent reports of age 3 positive emotionality correlated with several measures of negative emotionality and disinhibition at age 12.
Effortful control correlations
Both lab–based measures and parent reports of effortful control at age 3 showed good convergence with all informants’ reports of age 12 effortful control (Tables 4, and S5). As expected, several of these correlations were stronger within–informants than across informants or measures, although both the lab–based and parent–report measures converged highly with age 12 child reports of effortful control. However, lab–based measures of effortful control showed better discriminant validity in that they uniquely correlated with age 12 effortful control. In contrast, parent–reported age 3 effortful control showed multiple associations with measures of age 12 negative and positive emotionality.
Stability of latent constructs
Negative emotionality
Our model of latent age 3 negative emotionality predicting latent age 12 negative emotionality (Figure 1, top panel) showed a good fit to the data: χ2 (12) = 33.97, p = .001, CFI = .96, RMSEA = .054, 90% CI [0.034, 0.079]. All loadings of indicators on age 3 and age 12 negative emotionality were at least .20 and were significant at p < .001, although CBQ variables showed higher loadings than Lab–TAB or Experimenter Impressions negative emotionality. The standardized stability of age 3 to age 12 negative emotionality was .41 (p = .003), indicating moderate stability.

**p < .01. Factor loadings of observed indicators and stability of temperament from ages 3 to 12 years. All parameters are standardized estimates. Covariances on observed variable are covariances on the error terms of those variables. Error term on endogenous variables not depicted. Lab–TAB, Laboratory Temperament Assessment Battery; EC, effortful control; Imp, impulsivity; EI, Experimenter Impressions; CBQ, Child Behaviour Questionnaire; NA, negative affectivity; EC, effortful control; SNAP, Schedule for Non–adaptive and Adaptive Personality for Youth; NT, negative temperament; DIS, disinhibition; PT, positive temperament; Mother, mother–rated; Father, father–rated; Child, child–rated.
Positive emotionality
Our model of latent age 3 positive emotionality predicting latent age 12 positive emotionality (Figure 1, middle panel) had an adequate fit: χ2 (25) = 89.16, p < .001, CFI = .91, RMSEA = .068, 90% CI [0.053, 0.083]. All loadings of indicators on age 3 and age 12 positive emotionality exceeded .26 and were significant at or below p = .001. The standardized stability of age 3 to age 12 positive emotionality was .30 (p = .001), indicating modest stability.
Effortful control
Our model of latent age 3 effortful control predicting latent age 12 disinhibition (Figure 1, bottom panel) showed a good fit to the data, χ2 (12) = 36.21, p < .001, CFI = .98, RMSEA = .06, 90% CI [0.036, 0.083]. All loadings of indicators on age 3 effortful control and age 12 disinhibition exceeded .33 and were significant at or below p = .001. Higher scores on age 3 effortful control indicate higher levels of effortful control. The standardized stability of age 3 effortful control to age 12 disinhibition was −.53 (p < .001), indicating moderate to high stability. The stability of disinhibition was significantly greater than that of both positive emotionality (z = 2.41, p = .02) and negative emotionality (z = 2.82, p = .005). The stability of positive and negative emotionality did not significantly differ (p > .30).
Heterotypic continuity
Using the imputed ages 3 and 12 temperament scores from our latent models, we examined whether each age 3 temperament trait predicted each discriminant age 12 temperament trait after adjusting for each age 12 trait's respective level at age 3 (e.g. age 3 negative emotionality predicting age 12 disinhibition adjusting for age 3 effortful control). Given that our model examining heterotypic continuity was fully saturated, fit indices are perfect and are therefore not reported. In this model (Figure 2), in addition to significant homotypic stability coefficients, greater age 3 effortful control predicted increases in positive emotionality at age 12 and decreases in negative emotionality at age 12. Including heterotypic paths substantially attenuated the homotypic paths.

Relationships between imputed latent ages 3 and 12 temperament traits. **p < .01, *p < .05. Note: Variables are rectangles as they were used as observed variables in the model but were derived by extracting latent scores from the latent models in Figure 1. All parameters are standardized estimates. Parameters on curved arrows are correlations. Curved double–headed arrows on endogenous variables are correlations on the error terms of those variables.
Ancillary analyses
In order to ensure robustness of our results and stability estimates, latent stability models and the heterotypic continuity model were recomputed after deleting any participants who had missing data on any variable at either time point. This resulted in a sample of 260 with complete data on all variables. Results were quite similar to those from the full sample with missing data estimated. For positive emotionality, the model showed an adequate fit, χ2 (29) = 69.36, p < .001, CFI = .90, RMSEA = .08, 90% CI [0.060, 0.106], with a standardized stability estimate of .41. For negative emotionality, the model showed an adequate fit, χ2 (12) = 27.99, p = .006, CFI = .95, RMSEA = .07, 90% CI [0.037, 0.107], with a standardized stability estimate of .42. For effortful control, the model showed an adequate fit, χ2 (12) = 28.83, p = .004, CFI = .98, RMSEA = .07, 90% CI [0.039, 0.110], with a standardized stability estimate of −.57.
Results using listwise deletion for models examining heterotypic continuity were also similar to those originally reported using all available data. Age 3 effortful control predicted lower negative temperament at age 12 (β = −.19, p < .001) and greater positive temperament at age 12 (β = .15, p = .004). There was also a newly significant effect of greater negative emotionality at age 3 predicting decreases in age 12 positive temperament (β = −.14, p = .006), over and above age 3 positive emotionality. However, given that estimating missing data via FIML has been shown to reliably produce similar regression estimates as complete data sets (Schafer & Graham, 2002), we do not interpret this finding.
Discussion
This study examined the stability of temperament over almost a decade ranging from early childhood to early adolescence in a large sample that integrated both mother and father reports and lab–based observations of early childhood temperament, and mother, father, and child reports of temperament in early adolescence. At the zero–order level, mother–report and father–report measures generally showed stronger convergence within than across informants and stronger associations than did lab–based observations with age 12 parent reports. However, lab–based observations at age 3 generally showed greater specificity than parent reports in their associations with age 12 temperament, in that parent reports of early childhood temperament often correlated with non–corresponding traits at age 12. Stability estimates derived from latent models suggested at least moderate stability in temperament from ages 3 to 12. Finally, heterotypic analyses showed continuity from higher effortful control at age 3 to lower negative temperament and higher positive temperament at age 12. This suggests that effortful control, which is believed to facilitate regulation of emotional reactivity, influences the trajectories of negative and positive emotionality, both of which underpin children's emotional reactivity.
Zero–order stability and specificity of early to late childhood temperament
Mothers’ and fathers’ reports of child negative emotionality, positive emotionality, and effortful control at age 3 typically showed stronger convergence with their own report of the same trait at age 12 relative to the other parent's report. This is consistent with Neppl et al. (2010) who found stronger test–retest correlations within informants (mothers versus fathers) than across. Overall, results are consistent with the possibility that shared method variance, both in terms of method (questionnaires) and informant, increases estimates of stability. This is significant because most studies of the stability of temperament use the same measures and informants at each assessment.
Associations between age 3 parent–reported negative and positive emotionality with their respective age 12 child–reported counterparts showed some unexpected patterns. Father, but not mother, reports of negative emotionality at age 3 correlated with child–reported negative emotionality at age 12. Across mother and father reports of smiling–laughter and approach–anticipation, only father–reported smiling–laughter correlated with child reports of age 12 positive emotionality. Many would likely assume that a mother might have more detailed and accurate information regarding their young child's negative and positive emotionality. However, if child reports at age 12 are taken as an important marker of their age 12 temperament, this result suggests that fathers’ reports in early childhood provide better information in terms of their correlation with child reports in early adolescence.
The possibility that fathers have more detailed or accurate information compared with mothers regarding their young child's temperament, at least in terms of their convergence with children's reports in early adolescence, requires replication in future research. If verified, this may suggest that mothers should not be relied upon as the sole source of information regarding their child's temperament. These results may also suggest that fathers’ reports of temperament show greater utility compared with mothers’ reports in terms of their convergent validity with children's later reports. The reasons that fathers’ reports showed greater convergent validity than mothers’ with children's later self–reports cannot be gleaned from the current study. We speculate that fathers are more observant of, and attentive to, their young children's emotional expressions, which would be contrary to the oft–expressed position that mothers provide more accurate reports of child behaviour. Alternatively, fathers may be more likely to notice or attend to more extremes in children's emotional behaviour, which could lead to better discrimination between more and less emotional children. If so, this could be driving the greater convergence between early childhood father reports and early adolescent children's reports. Regardless, consistent with prior research highlighting that different informants provide unique sources of information (e.g. Achenbach, McConaughy, & Howell, 1987), these results underscore the importance of incorporating fathers’ reports in the comprehensive measurement children's temperament, whether for research or practical/clinical purposes.
Mother–reported and father–reported age 3 effortful control correlated with age 12 child reports of disinhibition, although more weakly than with parents’ reports of disinhibition at age 12. Experimenter Impressions and Lab–TAB positive emotionality as well as Experimenter Impressions negative emotionality were not significantly associated with child reports of their corresponding constructs at age 12, although Lab–TAB negative emotionality and both lab–based observations of effortful control were. These results suggest that lab–based measures, in particular those measuring effortful control and to some extent negative emotionality, tap the early antecedents of child reports of temperament in early adolescence.
It is possible that the behaviours and affects associated with negative emotionality (e.g. sadness and anxiety) and effortful control (e.g. actively restraining an impulsive behaviour) are experienced internally to a greater degree than they are expressed externally by the child. Lab–based tasks may elicit anxiety (e.g. a stranger approaching), sadness (e.g. not receiving an expected gift), or effortful control (e.g. being forced to wait while building a tower) that may be indicative of these specific behaviours in naturalistic settings and thus better converge with child reports early adolescence. As such, results further suggest the importance of including lab–based measures of child temperament in addition to parental reports.
Parent–rated age 3 temperament constructs often showed poor discriminant validity with age 12 temperament. This is consistent with some prior research that has used parent reports (e.g. Neppl et al., 2010). In the current study, father–reported age 3 negative emotionality showed the best convergent and discriminant validity, in that it correlated with mother, father, and child reports of negative emotionality at age 12 but not with any other age 12 variable. Mother reports of age 3 negative emotionality, however, correlated not only with mother and father age 12 negative emotionality but also with mother reports of disinhibition at age 12. While parent–reported CBQ smiling–laughter and approach–anticipation generally showed some specificity with age 12 positive emotionality, mother reports of smiling–laughter also correlated with mother–reported disinhibition at age 12. Parental reports of effortful control showed the poorest discriminant validity, in that they both correlated with parents’ reports of negative emotionality and positive emotionality at age 12, in addition to correlating with disinhibition.
Lab–TAB and Experimenter Impressions temperament variables, on the whole, showed good discriminant validity. Lab–TAB positive emotionality was uniquely related to father and mother reports of age 12 positive emotionality but was not significantly related to negative emotionality or disinhibition as rated by any informant. Similarly, Experimenter Impressions positive emotionality was uniquely related to age 12 mother–rated positive emotionality but was unrelated to any other age 12 variable. Lab–TAB impulsivity and Experimenter Impressions effortful control were uniquely related to mother, father, and child reports of disinhibition and were unrelated to any informant's report of positive or negative emotionality. Finally, Lab–TAB negative emotionality predicted child–reported age 12 negative emotionality, and Experimenter Impressions negative emotionality was specifically related to mother–reported age 12 negative emotionality. The one exception to the pattern of high discriminant validity for lab–based variables was that Lab–TAB negative emotionality at age 3 predicted lower levels of mother–rated positive emotionality at age 12.
Taken together, lab–based measures may show better discriminant validity in terms of their associations with age 12 temperament relative to parent reports of temperament in early childhood. One possibility is that parents may have somewhat misinterpreted the intent of the questions or their child's behaviour in everyday life. Similarly, the divergent correlations observed for parent reports may reflect shared method variance. Finally, we cannot rule out that what may appear to be a lack of discriminant validity is really greater sensitivity to true heterotypic continuity relative to lab–based measures. However, this would require systematic evidence across a variety of measures that particular traits predict other traits later in development. Regardless, these results further highlight the utility of incorporating lab–based measurements into the assessment of temperament in young children.
Stability of latent temperament constructs
Results from SEMs indicated that lab–based observations and mother and father reports of temperament in early childhood were viable manifest indicators of latent temperament constructs, although Lab–TAB negative emotionality and impulsivity and Experimenter Impressions negative emotionality and effortful control contributed relatively lower amounts of variance to their respective latent factors than the CBQ variables. Results suggest that assessing temperament via both parents and lab–based observations may yield a more comprehensive and nuanced measure than using only one of these methods.
Across all three temperament traits, SEMs indicated that temperament is moderately stable from ages 3 to 12 years, although effortful control showed somewhat greater stability than did negative or positive emotionality. Our results are consistent with Caspi and Silva (1995) who reported significant associations between temperament at age 3 and personality at age 18, as well as Guerin and Gottfried (1994) who reported moderate stability of maternal reports of children's mood (r = .23), distractibility (r = .37), and approach (r = .42) from ages 3 to 12.
Our stability estimates, based on latent variable modelling over a period of almost a decade, were comparable with or higher than those reported by Roberts and DelVecchio (2000), despite it being reasonable to expect higher stability estimates over shorter time periods. That suggests that if one uses multiple sources of information to reduce measurement error, there is a moderate to substantial degree of continuity in temperament from early childhood to early adolescence, even though this period is believed to be characterized by greater flux in individual traits than subsequent developmental periods (Posner & Rothbart, 2007; Rothbart & Bates, 2006). Interestingly, latent stability coefficients were higher than the highest within–informant correlations. This suggests that random error may have reduced the zero–order correlations more than informant/method variance inflated it. We are not aware of studies that have attempted to quantify the relative effects of informant/method variance versus random error variance on the associations between variables.
Taken together, these results provide strong evidence for homotypic continuity as well as moderate rank–order stability in temperament. These findings speak to the fundamental nature of temperament in children. Temperament is arguably at the core of human psychological identity; our results suggest that this identity begins to solidify over the course of childhood while also remaining malleable. These findings support the compromise perspective of temperamental stability and suggest that temperament should be conceptualized as only moderate stable, at least across childhood. This is consistent with the possibility that there are important environmental influences, such as parenting, culture (Ferguson, 2010), and non–shared environmental factors (Saudino, 2005), that affect the development of temperament. At the same time, the results suggest that early child temperament, at least when measured via multiple methods and informants, provides a marker with modest predictive utility of future temperament styles up to 9–10 years later.
Finally, the results from our path model also showed some evidence of heterotypic continuity. Specifically, greater effortful control predicted lower negative emotionality and higher positive emotionality at age 12. This is consistent with some previous research that found evidence of heterotypic continuity of temperament in early childhood. For instance, Putnam et al. (2008) infants high in surgency showed greater effortful control in toddlerhood, high levels of surgency in toddlerhood predicted lower effortful control in preschool, and toddler negative affect predicted lower effortful control in preschool (Putnam et al., 2008). Evidence also suggests that effortful control predicts lower levels of negative emotionality in later childhood (Kochanska & Knaack, 2003). Similarly, Dyson et al. (2015) also found that age 3 constraint, similar to effortful control, predicted lower age 6 fearfulness. The reasons that early childhood effortful control influences later negative and positive emotionality in early adolescence cannot be gleaned from the current study. However, this is consistent with prior evidence that effortful control is associated with more effective regulation of emotional reactivity and more generally with positive outcomes in later life (e.g. Eisenberg et al., 2009; Moffitt et al., 2011). Results suggest the need to search for potential mechanisms underlying homotypic and heterotypic continuity of temperament, such as specific genetic and environmental influences, gene–environment interactions, and gene–environment correlations (Caspi & Roberts, 2001; Plomin, Caspi, Pervin, & John, 1999).
Our results also highlight the importance of considering multiple sources of information regarding children's temperament in addition to parent reports. More generally, the results suggest that researchers should carefully consider the methods they use to assess temperament, as well as how they interpret their results based on those methods.
Limitations and future directions
This study had several notable strengths. It comprised a large sample of children who were followed for almost a decade and used fine–grained, standardized, lab–based measures of temperament along with both mother and father reports in early childhood, and mother, father, and child reports of temperament in early adolescence. However, some limitations should be noted. First, scores on the Lab–TAB and Experimenter Impressions variables may be influenced by situation–specific behaviours or affects. However, prior research suggests that the stability of lab–based observations of negative and positive emotionality is almost comparable with parent reports (Durbin et al., 2007).
Second, we were unable to assess the interrater reliability of the Experimenter Impressions measure. However, the integration of these variables with Lab–TAB ratings and parent reports in our latent models attenuates this concerns. Experimenter Impressions ratings were also derived from observations that overlapped with the Lab–TAB, which showed good interrater reliability and were moderately highly correlated with the Lab–TAB (Table S1).
Third, we also could not use lab–based measures at age 12, given that there are no validated observational measures of temperament in older youth. This raises the concern of whether stability estimates, even those derived from latent variables, may underestimate stability.
Fourth, we only examined rank–order stability. Future research should also examine mean level stability and individual trajectories over time in temperament. However, this will be challenging, as these questions require the same measures to be administered at each time point, which is difficult given considerations of developmental appropriateness. The lack of similarity of indicators in our latent models also precludes examining measurement invariance over time. This is a fundamental challenge to the field, as measures need to be developmentally appropriate while also still measuring the appropriate construct. This typically precludes using identical measures when developmental samples are followed across different development periods, as in the current study.
Fifth, although not a limitation per se, we did not examine predictors of change in temperament over time or moderators of the stability of temperament. While we hope to examine these issues in future research, prior to examining influences on stability, it is important to first understand normative stability. Sixth, future research would also benefit from examining stability of lower order temperament traits.
Seventh, the participants in our sample were predominantly White/European American and middle class and recruited through a commercial mailing list. Although the sample was demographically representative of the population in our geographic region (Suffolk County, Long Island, New York), this may constrain the generalizability of our findings.
Finally, were unable to test heterotypic continuity in our full latent models and instead had to circumvent this by imputing latent scores and using them as observed variables. It is unclear whether results would change using a multi–factorial confirmatory model. For instance, by estimating factors scores based on separate models for each construct, this may bias results in favour of finding homotypic paths as estimated scores were based on homotypic latent models that examined only one trait each.
Conclusion
Understanding the extent to which childhood traits are stable has important implications for our knowledge of normative development of individual differences and core aspects of how we view ourselves and others. Few studies have examined the stability of early childhood temperament through to early adolescence and integrated lab–based observations of temperament with mother and father reports in early childhood and mother, father, and child reports of temperament in adolescence. Zero–order correlations suggested that associations over time are influenced by a variety of considerations, in particular the assessment method and informant at baseline and follow–up. Latent models suggest that stability estimates increase after removing measurement error and showed moderate stability (bs = .30–.55) of temperament traits from ages 3 to 12. Thus, despite this being a period of greater plasticity and change relative to most other developmental periods (Posner & Rothbart, 2007; Rothbart & Bates, 2006), temperament shows substantial stability over the 9– to 10–year period spanning early childhood to early adolescence.
Acknowledgements
Our research was supported by NIMH Grant RO1 MH45757 (to D.N.K.) and by a postdoctoral fellowship (to D.K–S.) from the Social Sciences and Humanities Research Council of Canada.
Supporting info item
Supporting info item, per2151-sup-0001-Supplement_2 - The Stability of Temperament from Early Childhood to Early Adolescence: A Multi–Method, Multi–Informant Examination
Supporting info item, per2151-sup-0001-Supplement_2 for The Stability of Temperament from Early Childhood to Early Adolescence: A Multi–Method, Multi–Informant Examination by Kopala–Sibley Daniel C., PhDOlino Thomas, Durbin Emily, Dyson Margaret W., Klein Daniel N. and van Zalk Maarten in European Journal of Personality
Footnotes
Age 3 Lab–Tab Episodes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
