Evaluating the Consequential Validity of the Research-Based Early Mathematics Assessment

Abstract

Consequential validity (often referred to as “test fairness” in practice) is an essential aspect of educational measurement. This study evaluated the consequential validity of the Research-Based Early Mathematics Assessment (REMA). A sample of 627 children from PreK to second grade was collected using the short form of the REMA. We conducted two sets of analyses with different foci (item- or scale-level) for validation: differential item functioning (DIF) and consequential validity ratio (CVR) analyses. The analyses focused on the demographic subgroups of gender, English Language Learner status, and race/ethnicity. We found a low percentage of DIF items (less than 3%) and high CVRs (ranging from 96 to 98%). Both findings support the consequential validity and thus “fairness” of the REMA.

Keywords

early math assessment consequential validity ratio differential item functioning test fairness

Introduction

Early mathematics cognition has been shown to be a fundamental cognitive ability that substantively influences students' success in multiple academic domains (e.g., science, technology, engineering, and math [STEM] as well as literacy skills, Clements & Sarama, 2011; Duncan et al., 2007; Purpura et al., 2017) including important long-term outcomes (e.g., high school achievement, Watts et al., 2014). Due to an increased amount of attention devoted to early mathematics, researchers have recognized the need for high-quality math assessments for young children, especially assessments that provide a comprehensive and consistent evaluation from preschool through the primary grades. In response to this critical need, Clements and colleagues (Clements et al., 2008/2019a have conducted empirical studies to develop and validate the Research-based Early Mathematics Assessment (REMA), an instrument that measures the early mathematical competency of children from 3 to 8 years of age. In addition to the original full version, a short-form version of the REMA (i.e., REMA-SF) has been developed and validated to have a shorter assessment time and a higher level of efficiency during its administration (Dong et al., 2021).

Consequential validity, or the degree of score uses and interpretations that can support justified and fair decisions (Messick, 1995), has been recognized as an essential aspect of testing practices by many researchers and psychological professionals (e.g., AERA/APA/NCME, 2014; Dumas et al., 2022; Iliescu & Greiff, 2021). The REMA is a major measurement tool in early childhood research, and previous studies have demonstrated its internal and external validity using a variety of methods (Clements et al., 2008b; Dong et al., 2021). However, the consequential validity and fairness of the REMA has not been comprehensively investigated. Given that the justification of a measure’s consequential validity usually requires multiple sources of supporting evidence (Messick, 1995; Mislevy, 2018), the current study conducts two sets of analyses with different foci (item- or scale-level) to produce evidence and evaluate the consequential validity of the REMA.

Methods

Participants

The data used in this study were derived from the Development and Research in Early Mathematics Education (DREME) Network study (Coburn et al., 2018). The analytic sample consists of longitudinal math data for 627 children from PreK spring to second grade-Fall. These children were students at two large public-school districts in the western United States: 51.0% were female, and 30.5% were identified as English language learners (ELLs). Regarding their race or ethnicity, 66.1% of children were Hispanic, 9.3% were African American, 13.7% were Asian and Pacific Islander, 4.4% were non-Hispanic White, .8% were American Indian or Alaskan Native, and 5.6% were multiracial.

Measures

The data in the present research were collected using the REMA-SF. In the previous development study of the REMA-SF (Dong et al., 2021), multiple sources of validity evidence were established, such as sufficient content coverage, clear unidimensionality, good model-data-fit indices, as well as satisfying reliability and separation. Moreover, in the same investigation, this evidence was well-replicated to an external sample, which demonstrates that the validity of the REMA-SF is generalizable (i.e., cross-observer validity).

The REMA-SF consists of 80 items ordered by their Rasch difficulty parameters. Measuring young children’s math cognition can be challenging because children’s attention spans are relatively short; therefore, the administration of the REMA-SF applied a start, basal, and stop rule. Children began the assessment at the designated start points for each grade level. A basal level was established in which children received at least three consecutive items correctly, and they stopped after three consecutive errors. To avoid the confounding effect of language on children’s math scores, ELLs may receive the test in an alternative language version (e.g., Spanish). This also supports the consequential validity of the measure, particularly the interpretation of the construct scores.

The REMA-SF assesses correctness for each item and records students' strategies (i.e., problem-solving processes) when those are observable, and strategies were recoded into three levels of sophistication. Given that the strategies children use in solving mathematical problems are a critical component and reflection of their math learning (e.g., Biddlecomb & Carr, 2011; Fennema et al., 1998), both correctness and strategy codes (133 indicators in total) were used for scoring children’s math competency via a unidimensional partial credit Rasch model. We found a person reliability of .93 and a separation of 3.58 in this sample, which indicates that the REMA-SF scores were highly reliable (Wright & Masters, 1982). The strategy items allow a better differentiation among students with various levels of math competency (Clements et al., 2008b; Dong et al., 2021), and provide more effective and specific feedback regarding children’s learning to stakeholders (e.g., teachers) compared to item correctness only. Such cognitive-process feedback derived from children’s test scores can be perceived as major evidence for consequential validity (Iliescu & Greiff, 2021), because it may positively contribute to instruction and children’s future learning outcomes (e.g., amplify the effect of differentiated instruction or set more accurate learning goals for each child).

Analysis Overview

The current study examines the consequential validity evidence from both the item and scale levels. The item-level evidence was generated from conventional differential item functioning (DIF) analyses with the Rasch model (Holland & Wainer, 1993). A DIF item that yields different score patterns over subgroups even when the overall latent score is held equal is perceived as an indicator of measurement bias (e.g., ELLs perform worse than expected on specific items). In this study, we assessed DIF on the REMA-SF over the following demographic subgroups: gender, ELL, and race/ethnicity (recoded into non-Hispanic white = 0, and racial/ethnic minorities = 1), and all DIF analyses were performed via Winsteps 4.6 (Linacre, 2021).

In contrast, the scale-level evidence was produced via the Consequential Validity Ratio (CVR, Dumas et al., 2022). Although recently formalized, this index was inspired by past regression-based methods for examining test fairness (e.g., Millsap, 2011). The CVR is designed to capture how well the scores from a given test can predict a criterion free from the undue influence of examinees' demographics. Statistically, CVR is the ratio of the effect size of a focal measure (e.g., REMA scores) to the total variance explained by the test scores and participant demographics combined

C V R = η_{T e s t}^{2} / R^{2}

(1)

where R² is the total variance explained by the focal measure and demographics together, and η² is the effect size of the focal measure only. All effect sizes and associated CVRs were calculated in Stata 17 (StataCorp, 2021).

Results and Discussion

Item-level Consequential Validity Evidence from DIF Analyses

To investigate whether the REMA-SF items work similarly across different demographic groups, we first examined the DIF contrast values (i.e., the difference in item difficulty between the two groups or DIF effect size in logits) and associated statistical significance. There are two common criteria for identifying DIF items, and we reported the number and percentages of DIF items in terms of both criteria (see Table 1). The first criterion suggests a DIF contrast value greater than .5 logits with statistical significance as the evidence of a DIF item (Draba, 1977). The second criterion (sometimes called the Educational Testing Service [ETS] rule) suggests a contrast value larger than .43 with statistical significance (Zwick et al., 1999), which is more conservative than the former criterion.

Table 1.

Summary of DIF Analyses Results.

Groups	DIF Items		Number of DIF Items	Percentage of DIF Items, %
Groups	Correctness	Strategy	Number of DIF Items	Percentage of DIF Items, %
Gender
Criterion one	#13, #14	#33	3	2.26
Criterion two	#13, #14	#33	3	2.26
ELL
Criterion one	None	None	0	0
Criterion two	#18	#22	2	1.5
Race/Ethnicity
Criterion one	#37	#16	2	1.5
Criterion two	#37	#16	2	1.5

Notes: Criterion one suggests a DIF contrast value greater than .5 logits with statistical significance as the evidence of a DIF item (Draba, 1977), and criterion two suggests a contrast value larger than .43 with statistical significance (Zwick et al., 1999).

Moreover, Student’s t tests were used to examine statistical significance, and associated degrees of freedom were calculated via a joint approach (i.e., joint df; Satterthwaite, 1946; Welch, 1947). This study performed three DIF analyses (i.e., over gender, ELL, and racial groups). Notably, there are a total of 133 scoring indicators in the REMA-SF, so each set of DIF analyses tests 133 null hypotheses: indicator (1, 2, 3…, or 133) has the same difficulty for two groups. Such a large number of hypothesis tests might lead to Type I-error inflation. Instead of applying potentially overly conservative correction methods (e.g., Bonferroni correction), we set the alpha level to .01 to control Type I-error.

An item was identified as DIF when the contrast value was over the mentioned criterion and yielded statistical significance. From Table 1, we found three DIF items (2.26% of the items) over gender groups and two DIF items over racial/ethnic groups (1.5%), and the DIF results based on the two criteria were consistent. There was no DIF item over ELL groups with criterion one, but two DIF items (1.5%) were identified based on the second criterion. Overall, the number or percentages of DIF items are minimal in the REMA-SF, which means that the overwhelming majority of items in this measure work similarly across different subgroups. Furthermore, the small amount of DIF items does not necessarily indicate the items are of bad quality but could have reflected differences in the construct meaning across populations (Church et al., 2011). For example, one strategy item showed DIF over gender groups. Many studies have documented substantial gender differences in strategy use in the context of early mathematics (e.g., Carr & Jessup, 1997; Fennema et al., 1998; Zhu, 2007). In other words, children from different gender groups, even with the same math ability, may have a different probability of using a certain type of strategy, which results in the occurrence of DIF.

Since the original version of the REMA was developed, it has been widely used in early mathematical research (e.g., Weiland & Yoshikawa, 2013). The item-level evidence demonstrates the consequential validity of the REMA, which can support the justifications of its score interpretation and use (e.g., comparing children’s early math scores across gender, language, or racial groups). For the very small number of items that displayed DIF, applied researchers are advised to be cautious of those REMA items in cases where they are testing students from the affected demographic groups. Almost no DIF items were detected over ELL groups, which shows the benefits of administrating alternative language versions of the REMA to students with different language backgrounds. The DIF results from this work may provide essential clues and promising directions for future assessment revision (e.g., adapting detected DIF items into specific demographic groups or coding strategy sophistication level conditionally).

Scale-Level Consequential Validity Evidence from CVRs

To calculate the CVRs, we needed to choose focal and criterion measures. The measured math performance in this study is longitudinal, but the design of the REMA (e.g., start, basal, and stop rule) enables children to take different items of the measure at different timepoints, instead of taking the exact same items or parallel forms repeatedly. We therefore tested predictive relations between mathematics ability at two adjacent measurement occasions. In this way, the earlier math scores served as the focal measure, and the later scores served as the criterion measure.

Statistically, linear regression models were conducted to predict the math scores at a particular wave from the score at a prior time point and the student demographics, as well as the interactions among those predictors (e.g., predicting the K-spring score from the K-fall score and demographic variables, plus interactions). Effect sizes were generated for all predictors and their interaction terms. As shown in Table 2, four CVRs were calculated for these predictive relations. The CVR has a possible range of 0–1, and a higher CVR (i.e., a higher proportion of signal-to-noise) indicates better consequential validity of the focal measure in general. From Table 2, the effect sizes of demographic variables and their interaction terms are very small (partial η² ≤ .02). The CVRs range from 96% to 98%, which means that almost all the explained variances in the criterion measures are accounted for by the corresponding focal measures, and the proportion of explained variance by demographics is negligible. Notably, many children were recruited in the fall of kindergarten, which has led to a large portion of missing (84%) in the PK-spring datapoint. To avoid biased estimates and insufficient statistical power, we did not include the prediction of PK-spring scores to K-fall scores as part of the CVR analyses. However, the other waves of data of the children recruited in the PK-spring were still used for calculating all CVRs in Table 2.

Table 2.

Summary of Effect Sizes for Predictive Models and CVRs.

	K-Fall Predicts K-Spring	K-Spring Predicts 1^st Grade-fall	1^st Grade-fall Predicts 1^st Grade-spring	1^st Grade-spring Predicts 2^nd Grade-fall
	η²	η²	η²	η²
Total effect (model)	.47	.62	.64	.76
Focal measure	.45	.61	.63	.74
ELL	<.01	<.01	<.01	<.01
Gender	<.01	<.01	<.01	<.01
ELL × gender	<.01	<.01	<.01	<.01
Race/Ethnicity	<.01	.01	<.01	<.01
ELL × race/Ethnicity	<.01	<.01	<.01	<.01
Gender × race/Ethnicity	<.01	<.01	<.01	.02
ELL × gender × race/Ethnicity	<.01	<.01	<.01	<.01
CVR	96%	98%	98%	97%

Note. η² values for individual model terms are partial.

Scale-level consequential validity evidence is often produced via structural equation modeling-based analyses (e.g., multigroup confirmatory factor analysis). However, it can be methodologically challenging to execute such procedures for measures with many scoring items (e.g., 133 scoring variables in the REMA-SF), resulting in the omission of scale-level evidence in general practice. The current study applies a new and efficient method (i.e., CVRs) to produce scale-level evidence for the justification of consequential validity. Notably, although CVRs are straightforward to calculate and relatively easy to interpret, they are not recommended to be used to replace or skip over other necessary validation efforts (Dumas et al., 2022), such as the DIF analyses shown in this study.

In conclusion, both item-level and scale-level validity evidence shows that the inferences drawn from the REMA scores can be made in the same way across demographic groups, and the consequential validity and fairness of the measure is supported.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Heising-Simons Foundation (Grant # 2020-1777).

ORCID iD

Yixiao Dong

References

American Educational Research Association (2014). American psychological association, and national council on measurement in education. In Standards for educational and psychological testing. American Educational Research Association.

Biddlecomb

Carr

(2011). A longitudinal study of the development of mathematics strategies and underlying counting schemes. International Journal of Science and Mathematics Education, 9(1), 1–24. https://doi.org/10.1007/s10763-010-9202-y

Carr

Jessup

D. L.

(1997). Gender differences in first-grade mathematics strategy use: Social and metacognitive influences. Journal of Educational Psychology, 89(2), 318–328. https://doi.org/10.1037/0022-0663.89.2.318

Church

A. T.

Alvarez

J. M.

Mai

N. T. Q.

French

B. F.

Katigbak

M. S.

Ortiz

F. A.

(2011). Are cross-cultural comparisons of personality profiles meaningful? Differential item and facet functioning in the revised NEO personality inventory. Journal of Personality and Social Psychology, 101(5), 1068–1089. https://doi.org/10.1037/a0025290

Clements

D. H.

Dumas

Dong

Banse

H. W.

Sarama

Day-Hess

C. A.

(2020). Strategy diversity in early mathematics classrooms. Contemporary Educational Psychology, 60, Article 101834. https://doi.org/10.1016/j.cedpsych.2019.101834

Clements

D. H.

Sarama

(2011). Early childhood mathematics intervention. Science, 333(6045), 968–970. https://doi.org/10.1126/science.1204537

Clements

D. H.

Sarama

Wolfe

C. B.

Day-Hess

C. A.

(2008a). REMA—research-based early mathematics assessment. Kennedy Institute, University of Denver.

Clements

D. H.

Sarama

J. H.

Liu

X. H.

(2008b). Development of a measure of early mathematics achievement using the Rasch model: The research‐based early maths assessment. Educational Psychology, 28(4), 457–482. https://doi.org/10.1080/01443410701777272

Coburn

C. E.

McMahon

Borsato

Stein

Jou

Chong

LeMahieu

Franke

Ibarra

Stipek

(2018). Fostering pre-k to elementary alignment and continuity in mathematics in urban school districts: Challenges and possibilities. Stanford Graduate School of Education. edpolicyinca.org.

10.

Dong

Clements

D. H.

Day-Hess

C. A.

Sarama

Dumas

(2021). Measuring early childhood mathematical cognition: Validating and equating two forms of the Research-based Early Mathematics Assessment. Journal of Psychoeducational Assessment, 39(8), 983–998. https://doi.org/10.1177/07342829211037195

11.

Dong

Dumas

. (2020). Are personality measures valid for different populations? A systematic review of measurement invariance across cultures, gender, and age. Personality and Individual Differences, 160, Article 109956. https://doi.org/10.1016/j.paid.2020.109956

12.

Draba

R. E.

(1977). The identification and interpretation of item bias. Educational statistics laboratory. University of Chicago. Memo 25 www.rasch.org/memo25.htm

13.

Dumas

Dong

McNeish

(2022). How fair is my test?: A ratio statistic to help represent consequential validity. Advance online publication. https://doi.org/10.1027/1015-5759/a000724

14.

Duncan

G. J.

Dowsett

C. J.

Claessens

Magnuson

Huston

A. C.

Klebanov

Pagani

L. S.

Feinstein

Engel

Brooks-Gunn

Sexton

Duckworth

Japel

(2007). School readiness and later achievement. Developmental Psychology, 43(6), 1428–1446.

15.

Fennema

Carpenter

T. P.

Jacobs

V. R.

Franke

M. L.

Levi

L. W.

(1998). A longitudinal study of gender differences in young children’s mathematical thinking. Educational Researcher, 27(5), 6–11. https://doi.org/10.2307/1176733

16.

Holland

P. W.

Wainer

(1993). Differential item functioning. Erlbaum.

17.

Iliescu

Greiff

(2021). On consequential validity. European Journal of Psychological Assessment, 37(3), 163–166. https://doi.org/10.1027/1015-5759/a000664

18.

Linacre

J. M.

(2021). Winsteps® Rasch measurement computer program. Winsteps.com.

19.

Messick

(1995). Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741

20.

Millsap

R. E.

(2011). Statistical approaches to measurement invariance. Routledge.

21.

Mislevy

R. J.

(2018). Sociocognitive foundations of educational measurement. Routledge.

22.

Nguyen

Watts

T. W.

Duncan

G. J.

Clements

D. H.

Sarama

J. S.

Wolfe

Spitler

M. E.

(2016) Which preschool mathematics competencies are most predictive of fifth grade achievement? Early Childhood Research Quarterly, 36, 550–560. https://doi.org/10.1016/j.ecresq.2016.02.003

23.

Purpura

D. J.

Logan

J. A. R.

Hassinger-Das

Napoli

A. R.

(2017). Why do early mathematics skills predict later reading? The role of mathematical language. Developmental Psychology, 53(9), 1633–1642. https://doi.org/10.1037/dev0000375

24.

Satterthwaite

F. E.

(1946). An approximate distribution of estimates of variance components. Biometrics Bulletin, 2(6), 110–114. https://doi.org/10.2307/3002019

25.

StataCorp (2021). Stata statistical software: Release 17. StataCorp LLC.

26.

Watts

T. W.

Duncan

G. J.

Siegler

R. S.

Davis-Kean

P. E.

(2014). What's past is prologue: Relations between early mathematics knowledge and high school achievement. Educational Researcher, 43(7), 352–360. https://doi.org/10.3102/0013189X14553660

27.

Weiland

Yoshikawa

(2013). Impacts of a prekindergarten program on children's mathematics, language, literacy, executive function, and emotional skills. Child Development, 84(6), 2112–2130. https://doi.org/10.1111/cdev.12099

28.

Welch

B. L.

(1947). The generalisation of student's problems when several different population variances are involved. Biometrika, 34(1-2), 28–35. https://doi.org/10.1093/biomet/34.1-2.28

29.

Wright

B. D.

Masters

G. N.

(1982). Rating scale analysis. MESA Press.

30.

Zhu

(2007). Gender differences in mathematical problem-solving patterns: A review of literature. International Education Journal, 8(2), 187–203.

31.

Zwick

Thayer

D. T.

Lewis

(1999). An empirical Bayes approach to Mantel‐Haenszel DIF analysis. Journal of Educational Measurement, 36(1), 1–28. https://doi.org/10.1111/j.1745-3984.1999.tb00543.x