Investigation of Science Inquiry Items for Use on an Alternate Assessment Based on Modified Achievement Standards Using Cognitive Lab Methodology

Abstract

This study evaluated the benefits of item enhancements applied to science-inquiry items for incorporation into an alternate assessment based on modified achievement standards for high school students. Six items were included in the cognitive lab sessions involving both students with and without disabilities. The enhancements (e.g., use of visuals, reading support) were intended to improve access to the items for students who had grade-level science content knowledge, but whose disability may impact their ability to answer the items in the original form. Students were asked to think aloud while answering items and answer follow-up questions about specific item-enhancement features. Achievement did not show much improvement but reported cognitive effort suggests reduction in perceived difficulty of enhanced items.

Keywords

alternate assessment test accommodations cognitive labs item development

Recent federal regulations have permitted states to develop assessments for students with documented disabilities whose disabilities have precluded them from achieving proficiency on the state’s general assessments, but whose disabilities are not severe enough to qualify for the state’s existing alternate assessment. Such assessments are referred to as alternate assessments based on modified achievement standards (AA-MAS). Many states have investigated the target population for the AA-MAS and the feasibility of developing such assessments. Development of multiple-choice items for use on an AA-MAS first involves identification of test items used on the general assessment. Then, items are reviewed by content-area and assessment specialists to determine changes that may improve access to content without changing the underlying construct. The intent is to increase the likelihood that students with content knowledge will answer the item correctly by removing barriers associated with students’ disabilities that may prohibit demonstration of mastery. Item development for this type of alternate assessment can be informed by literature on accommodations and modifications as well as theories of learning and cognition and item-writing guidelines. Cognitive lab or think-aloud methodology can also be useful in guiding item development for such an assessment.

Alternate Assessments

To ensure that students with disabilities are included in state accountability systems, the Individuals With Disabilities Act of 1997, reauthorized in 2004, requires that students with disabilities be included in state assessment programs. The No Child Left Behind Act of 2002 required states to develop accountability systems that included assessments where all students were tested in the content areas of English language arts, math, and science. Building on the 2003 regulations permitting the development of alternate assessment based on alternate achievement standards (AA-AAS) for students with the most severe cognitive disabilities, the United States Department of Education (USDOE) authorized the development of modified academic achievement standards for some students with disabilities in regulations published on April 9, 2007. Under the 2007 regulations, states may develop an AA-MAS for students who receive on-grade level instruction, but whose disabilities have precluded them from achieving grade-level proficiency and are unlikely to achieve timely grade-level proficiency. States may count students achieving proficiency on the AA-MAS in their Adequate Yearly Progress calculation at a maximum number of 2% of all students assessed at the state and district levels. By constructing such alternate assessments, the intent is for eligible students to be included in policy decisions, ultimately increasing achievement among this group (Brower et al., 2003).

Several methods exist to assess achievement among students with disabilities, including performance assessments, observations, interviews or rating scales completed by individuals familiar with the student, examination of school records, reviews of Individualized Education Plans, and testing; of these, testing is the most common method (Quenemoen, Thompson, & Thurlow, 2003; Ysseldyke & Olson, 1999). In 2007, more than half of states within the United States reported considering developing an AA-MAS (National Center on Educational Outcomes [NCEO], 2007), and by 2009, 14 had made guidelines for participation in an AA-MAS available to the public (NCEO, 2010b). According to Lazarus and Thurlow (2009), as of November 2008, 8 states had developed an AA-MAS in various content areas, with only 5 states offering an AA-MAS in science at the high school level. In addition, none of these states had successfully completed the peer-review process for the AA-MAS established by the USDOE (Lazarus & Thurlow, 2009).

Accommodations and Modifications

The terms accommodation and modification are used to describe changes that are made to both instruction and assessment to allow students with disabilities to access content and show what they know (Haigh, 1999). Though often used interchangeably (Haigh, 1999; Kettler, Elliott, & Beddow, 2009), researchers have distinguished between testing accommodations and modifications. An accommodation is typically defined as a testing change intended to facilitate student’s access to test content while preserving test validity (Kettler et al., 2009; Thurlow, Lazarus, Thompson, & Blount Morse, 2005). Butler and Stevens (2001) identified two categories of testing accommodations, including changes to the test (e.g., reducing reading load) and to the testing procedure (e.g., providing an English dictionary). In contrast to accommodations, a modification occurs when the content of an item has changed and evidence that the original construct has been preserved is lacking (Koretz & Hamilton, 2006; Phillips & Camara, 2006). The focus of the current article is on changes to test items that are intended to increase students’ ability to demonstrate their knowledge, without changing the construct that was being assessed. Thus, work reported in this article is more aligned with literature on accommodations. However, the more general term enhancement is used to describe changes to an item intended to make it more accessible.

Research and Theory Supporting Item Enhancements

Universal Design for Learning (UDL; Center for Applied Special Technology, 2011) promotes access to educational curriculum for all students, including those with disabilities. This is accomplished by providing students with multiple means of representation, action and expression, and engagement. Research has been conducted on incorporating UDL principles into item development for student assessment (Johnstone, Thurlow, Moore, & Altman, 2006; Thompson, Johnstone, Anderson, & Miller, 2005). UDL principles, as applied to assessment, provide multiple ways for students to show understanding of concepts. As Dolan and Hall (2001) explained, assessments that rely on a single method of presentation or expression measure knowledge and skills that may not be included in the specific curriculum standards or instructional goals. The item enhancements used in this study incorporated multiple methods of presenting the information. Visuals were added to items that did not require a graphic to solve, permitting students to visualize the problem as another method of interpreting the information. Gaster and Clark (1995) recommend the use of visuals that are simple, labeled, and positioned adjacent to the text they support, to improve document readability when narrative describes a process.

UDL also supports the use of read-aloud support (items and answer choices are read to the student). This enhancement offers another way for students to obtain information needed to solve the problem. Read-aloud support is one of the most common testing accommodations used by states (NCEO, 2010a; Sireci, Li, & Scarpati, 2005; Thurlow et al., 2005). However, studies report mixed findings regarding the impact of read-aloud support on student achievement. Three of the 7 studies examined by NCEO (2010a) indicated that read-aloud support was related to increases in student achievement only for students with disabilities. Sireci et al. (2005) reviewed 10 studies on read-aloud support, 8 of which were able to compare the performance of students with and without disabilities. Six of the 8 studies supported read-aloud support, indicating that it improved validity or had no effect on the performance of either group.

Increasing white space is also supported by UDL. Smith and McCombs (1971) found that white space increases readability. White space is especially important for individuals with poor vision (Gaster & Clark, 1995). In general, research supports the use of UDL principles in enhancing test items. Johnstone (2003) examined the impact of packages of items enhanced using UDL by administering an original and enhanced test to 231 students. They found that enhanced items had a positive effect on student performance.

In addition to UDL, Cognitive Load Theory (CLT) was also used to guide item enhancements. Conceptualized by Sweller (1988), CLT describes the ways learners allocate cognitive resources during learning and has been applied to classroom settings to improve student learning (Clark, Nguyen, & Sweller, 2006). Three types of short-term memory loads are posited by CLT—intrinsic load, germane load, and extraneous load. Intrinsic load refers to the interaction between the nature of the information to be learned (e.g., number of elements to be processed in working memory) and the learner’s expertise. Germane load refers to allocating working memory to develop and automate schema. Intrinsic load and germane load are both considered necessary in problem solving, whereas extraneous load (the way in which information is presented during instruction) can hinder problem solving. When enhancing items, the goal was to minimize extraneous load by eliminating construct-irrelevant information and simplifying language.

Item-Writing Guidelines

Haladyna, Downing, and Rodriguez (2002) synthesized test and item-writing research into 31 guidelines for writing multiple-choice items. These guidelines support the use of simplified language and removal of irrelevant information. As Case and Swanson (2001) noted, when irrelevant information is included, the item becomes a measure of reading speed. Gronlund (2003) noted that when items include complex wording and sentence structure, they are more likely to capture reading comprehension than content knowledge. Research in the area of testing accommodations used in mathematics also indicates that simplifying language assists special education students (Johnson & Monroe, 2004) in demonstrating their knowledge. Though the studies reviewed support the use of item enhancements, research conducted with students with disabilities is limited and some research suggests that item enhancements may not always be beneficial for these students (e.g., Helwig & Tindal, 2003).

Cognitive Lab Methodology

Cognitive lab methodology was used to evaluate various item enhancements considered in this study. Cognitive labs (i.e., think-aloud methods) require students to verbalize their thoughts in real time as they engage in problem-solving processes. Ericsson and Simon (1993) provided a rationale for developing think-aloud methods to obtain concurrent and retrospective verbal reports. They developed an approach, supported by research on information processing, for collecting concurrent and retrospective verbal reports that exhibited minimal effect on participants’ problem-solving and cognition abilities. Cognitive lab methodology can be useful in item writing because it allows researchers to explore process and product by analyzing both affective and cognitive processes (Lau, 2006). In particular, cognitive labs are useful for evaluating prominent item characteristics and testing procedures to assess the existence of construct-irrelevant variance that may prevent students from demonstrating their knowledge (Johnstone, Bottsford-Miller, & Thompson, 2006). As the methods involve extensive data collection at the individual level, cognitive labs generally are conducted with small numbers of participants to allow for in-depth analyses of how individuals process and use information.

Kettler et al. (2009) described methods for developing, modifying, and evaluating items for an AA-MAS, with an emphasis on the use of cognitive lab methods and experimental field tests. They note that the Standards of Educational and Psychological Testing (Standards; American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999) emphasize the value of information regarding student responses and perceptions in supporting the development of assessments for students in general and for students with disabilities specifically. The Standards indicate that questioning students in their use of problem-solving strategies is a recommended technique to inform item development, which offers support for the use of cognitive lab methods as a method to inform the development of an AA-MAS. However, Kettler et al. noted that relatively few studies on testing accommodations (e.g., Elliott & Marquart, 2004; Fulk & Smith, 1995; Kosciolek & Ysseldyke, 2000; Lang, Elliott, Bolt, & Kratochwill, 2008; McKevitt & Elliott, 2003) have utilized this methodology for the purpose of test construction or test validity. Two recently published studies utilized cognitive lab methods to inform AA-MAS item development (Kettler et al., 2009; Roach, Beddow, Kurz, Kettler, & Elliott, 2010). Both studies focused on the content areas of reading/English language arts and mathematics.

Purpose

This study used cognitive lab methodology to provide information about item enhancement features for use on an AA-MAS in high school science, an area about which little information is currently available. This study was part of the Operationalizing Alternate Assessment for Science Inquiry Skills (OAASIS) project, funded by the USDOE under the Enhanced Assessment Grant program. OAASIS was a collaborative effort between three states (South Carolina [lead state], South Dakota, and Wyoming). State-level OAASIS personnel included state department officials who specialize in assessment, special education, and science content. Consultants from university and private agencies were hired to provide guidance on the research design, develop and administer test items, and perform evaluation services.

Consistent with federal regulations, the target student population for this AA-MAS must have a documented disability, have received grade-level science instruction, and be unlikely to demonstrate proficiency on the general science assessment. A main objective of the OAASIS project was to design and evaluate assessment strategy formats based on items developed from essential constructs of high school science-inquiry standards. These standards were common across the three partner states. Major OAASIS research activities included a cognitive lab study to guide item development and a pilot assessment to provide generalizable results on the effectiveness of item enhancements on an AA-MAS in high school science.

This manuscript presents detailed information on the cognitive lab study that investigated student performance on enhanced versus original items and provided information on student preferences of item features. Data from this study were used to recommend item changes for the large-scale pilot test that took place in all three partner states in spring 2010 (see Dickenson et al., 2011). This study included three groups of students: (a) students with a disability considered eligible for AA-MAS (SWD-E), (b) students with a disability not considered eligible for AA-MAS (SWD-NE), and (c) students without a disability (SWOD). SWD-E represent the target population for an AA-MAS, whereas SWD-NE and SWOD represent comparison groups. The research questions addressed in this study include the following:

Research Question 1: How do student groups compare in performance on a measure of oral reading fluency (ORF)?

Research Question 2: Does the relationship between ORF and raw scores on the assessment differ between original and enhanced versions of the items?

Research Question 3: Do average outcome measures of raw test score, perceived cognitive effort, and time spent solving items differ by condition (original vs. enhanced) and group?

Research Question 4: Do average outcome measures of raw test score, perceived cognitive effort, and time spent solving items differ by item type (stand alone vs. passage based) and group?

Research Question 5: Do students perceive individual and whole package enhancements as helpful?

This study contributes to the literature on item development for the 2% assessment. Although a number of studies have been published about AA-MAS developed for English language arts and mathematics (NCEO, 2010a), research specific to enhancing science items or developing an AA-MAS for science is scant. This study addresses the need for additional information about the development and use of alternative assessments in high school science.

Method

Participants

In spring 2009, OAASIS personnel conducted school visits to recruit participants for the cognitive lab sessions. During these visits, school personnel were introduced to the purpose of the study and student assent and parent/guardian consent forms were provided. Once forms were returned, on-site testing sessions were scheduled. Think-aloud sessions were completed between May 5, 2009, and June 2, 2009, at four high schools in South Carolina. A total of 18 volunteer students who had completed a high school biology course and were recruited by the schools’ special education directors participated. The small sample size is typical of the item-development phase. All students with disabilities in the sample took the general state assessment with accommodations, which indicates that their disabilities were not severe enough to qualify for the state’s AA-AAS.

Data were collected from three groups (SWD-E, SWD-NE, and SWOD) of 10th- and 11th-grade students living in suburban and rural areas of South Carolina. Two thirds of the total sample was male and the majority of the students were White. The SWOD group had the same percentage distribution by gender as the whole sample (see Table 1). All students in the SWD-NE group were male, but there were slightly more female than male students in the SWD-E group. Only two students in the sample (one SWD-E and one SWOD) were African American.

Table 1.

Demographics by Eligibility Group.

	SWD-E	SWD-NE	SWOD	Total
Demographic	n (%)	n (%)	n (%)	n (%)
Gender
Male	3 (42.9)	5 (100.0)	4 (66.7)	12 (66.7)
Female	4 (57.1)	0 (0.0)	2 (33.3)	6 (33.3)
Ethnicity
White (not Hispanic)	6 (85.7)	5 (100.0)	5 (83.3)	16 (88.9)
Black or African American	1 (14.3)	0 (0.0)	1 (16.7)	2 (11.1)

Note. SWD-E = students with disabilities considered eligible for an alternate assessment based on modified achievement standards (AA-MAS); SWD-NE = students with disabilities not considered eligible for an AA-MAS; SWOD = students without disabilities.

Items and Enhancements

Six items were selected from a pool of available science-inquiry items to be enhanced and included on the AA-MAS pilot test for high school science. Items were chosen to reflect different item types (stand alone and passage based) and to assess multiple science content standards, which were previously identified and mapped across states. Availability of items assessing science-inquiry skills also influenced item selection. Table 2 presents the item number and the corresponding standard from the common state science standards.

Table 2.

Cognitive Lab Items With Corresponding Academic Standard.

Item number	Standard	Description of standard
5	B-2.2	Summarize the structures and functions of organelles found in a eukaryotic cell (including the nucleus, mitochondria, chloroplasts, lysosomes, vacuoles, ribosomes, endoplasmic reticulum [ER], Golgi apparatus, cilia, flagella, cell membrane, nuclear membrane, cell wall, and cytoplasm).
6	B-5.3	Explain how diversity within a species increases the chances of its survival.
7	B-3.3	Recognize the overall structure of ATP—namely, adenine, the sugar ribose, and three phosphate groups—and summarize its function (including the ATP-ADP cycle).
8	B-6.1	Explain how the interrelationships among organisms (including predation, competition, parasitism, mutualism, and commensalism) generate stability within ecosystems.
9	B-1.4	Design a scientific investigation with appropriate methods of control to test a hypothesis (including independent and dependent variables) and evaluate the designs of sample investigations.
10	B-1.4	Design a scientific investigation with appropriate methods of control to test a hypothesis (including independent and dependent variables) and evaluate the designs of sample investigations.

Note. ATP = adenosine triphosphate; ADP = adenosine diphosphate.

Test forms used during cognitive labs included six multiple-choice items assessing standards taught in biology/life science. Item enhancements included (a) removing a distractor, the least plausible answer choice; (b) increasing visual white space on the screen; (c) presenting a related picture or graphic; (d) simplifying language; and (e) presenting a read-aloud feature where a human voice recording, played through the computer, read items and answer choices to the student. Guided by CLT and UDL principles, these enhancements were designed to help students who consistently fail to meet proficiency by facilitating the accessibility and measurement of the students’ knowledge of science content without altering the construct being measured. Individual item enhancements were decided with input from personnel with expertise in the areas of special education, assessment, and science content to ensure that the original construct was not altered.

On the six-item assessment used in this study, the first three items were stand-alone items consisting of a short paragraph and/or item stem with answer choices. The original stand-alone items consisted of text only and offered four answer choices (one correct answer and three distractors) with each item. To enhance these items, pictures were added to the stem or answer choices. For two items, the least plausible distractor was removed. All items also included read-aloud support.

The last three items were passage-based items consisting of a paragraph describing an experiment to be used to answer all three items. For the original items, the paragraph was presented only with the first of the three related items. In the original form, each item consisted of an item stem and four answer choices. For the enhanced version, the language in the paragraph was simplified and the experimental steps were listed. A picture was added to illustrate the scenario described in the paragraph. A distractor was removed and reading support was added to all three items. The paragraph was read aloud to the students for the first passage-based item and the visual was presented with all three items. Table 3 displays the specific enhancement features that were associated with each item.

Table 3.

Enhancement Features by Item for OAASIS Think-Aloud Sessions.

		Enhancement feature
Item	Item description	Read-aloud	Eliminated answer choice	Added graphic	Simplified vocabulary	Shortened stem	Added white space
5	Animal/plant cell	•	•	•			•
6	Environment	•	•	•			•
7	ATP	•		•	•	•	•
8	Pond #1	•	•	•	•	•	•
9	Pond #2	•	•	•	•	•	•
10	Pond #3	•	•	•	•	•	•

Note. ATP = adenosine triphosphate.

Cognitive Lab Session Procedures

Each student’s session was recorded using video and audio equipment. Students were instructed to verbalize their thoughts as they solved the test items. Following the recommendations of Johnstone, Bottsford-Miller, et al. (2006), the primary researcher prompted students only when they were silent for more than 3 s by reminding them to “keep thinking aloud” or “keep talking.” Other than these prompts, researchers remained silent when students were thinking aloud to avoid disrupting their thought patterns (Ericsson & Simon, 1998). Other researchers recorded observations of the students and answers to the follow-up questionnaire, and monitored video and audio equipment.

To begin, the primary researcher demonstrated the think-aloud process on a practice item and then asked the student to engage in sample items to practice verbalizing his or her thoughts. Up to three additional items of lower grade-level science content were completed as practice on the computer-based assessment. Next, each student completed one of two test forms containing six high school biology test items, three original and three enhanced, while thinking aloud.

Table 4 displays the number of students who took each form by eligibility group. To ensure balance in the number of students in each eligibility group and test form condition, students were assigned to each test form systematically by alternating between the two forms within each eligibility group. Form 00 included three original items at the beginning and three enhanced items at the end. Form 01 included three enhanced items at the beginning and three original items at the end. The first three items on each form were stand-alone items and the second three were passage-based items. Therefore, students took all stand-alone items in one condition (enhanced or original) and all passage-based items in the other condition. This presented a confounding issue when evaluating the effects of enhancements.

Table 4.

Number of Students by Eligibility Group and Test Form.

Eligibility group	Test Form 00	Test Form 01	Total
SWD-E	4	3	7
SWD-NE	2	3	5
SWOD	3	3	6
Total	9	9	18

Follow-Up Questions and Additional Measures

After the assessment, a follow-up questionnaire was orally administered to each student. First, students were asked to identify the perceived difficulty of each item that he or she completed, which provided one form of “cognitive effort” data. Students were asked to provide a rating on how hard they worked to solve each item. In addition, in follow-up coding, researchers recorded the amount of time students spent completing each test item as another measure of “cognitive effort,” which was used to determine item-enhancement effects on students’ efficiency in completing test items (Clark et al., 2006). The follow-up questionnaire also asked questions on preferences of the enhancements as a whole package, the inclusion of visuals, distractor removal, and read-aloud support. Participants were not asked about preferences regarding additional white space, simplified language, or shortened stem as these could not be evaluated because students were not provided with both versions of each item. Participants’ perceptions of item enhancements were discussed as a whole rather than item by item.

Finally, to obtain information on each student’s reading skills, the researchers administered an ORF test to each student using a 9th-grade-level reading passage (Ohio Literacy Alliance, n.d.). Students were video and audio recorded while reading the passage aloud for 1 min. This provided a measure of reading fluency rate.

Variables and Analysis Methods

Data collected included ORF scores, item responses (graded correct or incorrect), answers to follow-up questions about item features, and time spent solving items. Definitions of variables that were examined in this study are provided in Table 5.

Table 5.

Variable Definitions for the Cognitive Lab Study.

Variable	Definition
Eligibility group	The student group with three levels; students with disabilities considered eligible for the alternate assessment based on modified achievement standards (AA-MAS; SWD-E), students with disabilities not considered eligible for the AA-MAS (SWD-NE), and students without disabilities (SWOD). The schools’ special education coordinators determined eligibility based on provided federal guidelines.
Oral reading fluency (ORF) rate	The number of words read correctly in 1 min. Two independent researchers viewed the videotaped sessions and/or listened to audio recordings and counted the number of words read correctly. Interrater reliability using Pearson’s correlation coefficient was estimated at .999 and the second rating was used as ORF rate in the analysis.
Condition	The version of the item with two levels; original and enhanced.
Item type	The type of item with two levels; stand-alone and passage-based.
Raw test score	The number of items to which the student correctly responded. The score by condition was out of three for each format, original and enhanced. The score by item type was out of three for each format; stand-alone and passage-based.
Cognitive effort	The rating of how hard a student reported he or she worked when solving each item on a scale from 1 (not very hard) to 5 (very hard).
Time spent solving	The time between when the student first viewed the item until he or she selected the final response. Two independent researchers viewed the videotaped sessions and/or listened to audio recordings and recorded time spent solving each item for each student. Interrater reliability using Pearson’s correlation coefficient was estimated at .971 and the average of the two ratings was used as time spent solving in the analysis.
Perception of enhancements as a whole package	A student’s response to a question pertaining to ordering. Students were asked whether the items were easier at the beginning, easier at the end, or about the same throughout. Students received three enhanced items either first or second, depending on the test form. The test form was taken into account when summarizing the results.
Perception of specific enhancement features	A student’s response to questions about specific item features. Students were asked whether the addition of visuals, reduction of answer options, and read-aloud support helped them understand the question and prompted to respond yes, no, or it did not make a difference.

Means and standard deviations for ORF rates by eligibility group were computed to address Research Question 1. To address Research Question 2, rank correlations (Spearman’s ρ) were calculated between ORF rate and raw test score by condition. Means, standard deviations, and effect size estimates (Cohen’s d computed for independent samples) were computed by condition and eligibility group to address Research Question 3 and by item type and eligibility group to address Research Question 4. To address Research Question 5, percentages of perception ratings for whole package and specific enhancement features were calculated for the sample overall and by eligibility group and form.

Results

Results associated with each research question are presented below. When interpreting results, note that the cognitive labs utilized a small sample of students. The findings provide initial information on patterns that should be studied with a larger sample to make generalizations.

Research Question 1: ORF Rate by Eligibility Group

Table 6 presents the average ORF for each group of students. SWOD had the highest ORF rate whereas SWD-E had the lowest ORF rate. The average ORF for the SWD-E group fell below the threshold for mastery classification but remained in an instructional range.

Table 6.

Reading Fluency (Words per Minute) by Group.

Statistic	SWD-E	SWD-NE	SWOD	Total
M (SD)	84.43 (20.22)	109.20 (41.61)	164.33 (32.30)	117.94 (45.86)
N	7	5	6	18

Note. 0–49 = frustration level, 50–99 = instructional level, and ≥100 = mastery level (Deno & Mirkin, 1977). SWD-E = students with disabilities considered eligible for an alternate assessment based on modified achievement standards (AA-MAS); SWD-NE = students with disabilities not considered eligible for an AA-MAS; SWOD = students without disabilities.

Research Question 2: Relationships Between ORF Measure and Raw Test Scores

A significant correlation between ORF rate and raw test score would indicate that reading fluency is related to the score whereas a nonsignificant correlation would indicate no association. As the test items were intended to measure science content, it is desirable that these measures are not associated. A statistically significant positive correlation was found between ORF rate and raw score for original items (ρ = .682, p = .002). The correlation between ORF rate and raw score for enhanced items did not differ with statistical significance from zero (ρ = .345, p = .161).

Research Question 3: Outcome Measures by Condition and Eligibility Group

Table 7 summarizes results on the three outcome measures by testing condition and eligibility group as well as the total sample. Overall, the average raw score for the enhanced items was slightly higher than for the original items; however, this difference was only seen in the SWOD group. The SWOD group performed slightly better, on average, on enhanced test items as compared with the original test items with a medium effect size. Both the SWD-E and the SWD-NE groups exhibited the same average total raw score for original and enhanced items.

Table 7.

Testing Outcome Measures by Condition and Group.

	M (SD)
Outcome measure	Original	Enhanced	Mean difference	Effect size
Raw test score (0 to 3)
Total (n = 18)	2.06 (0.94)	2.11 (0.76)	0.05	0.06
SWD-E (n = 7)	1.71 (1.11)	1.71 (0.76)	0.00	0.00
SWD-NE (n = 5)	2.00 (1.00)	2.00 (0.71)	0.00	0.00
SWOD (n = 6)	2.50 (0.55)	2.67 (0.52)	0.17	0.59
Cognitive effort (1 to 5)
Total (n = 18)	2.98 (1.25)	2.41 (1.14)	−0.57	0.51
SWD-E (n = 7)	3.00 (1.41)	2.76 (1.22)	−0.24	0.21
SWD-NE (n = 5)	2.73 (1.39)	2.20 (1.09)	−0.53	0.57
SWOD (n = 6)	3.17 (0.92)	2.17 (1.04)	−1.00	1.25
Time spent (seconds)
Total (n = 18)	94.01 (63.10)	82.87 (50.54)	−11.14	0.20
SWD-E (n = 7)	114.26 (83.35)	96.02 (56.78)	−18.24	0.26
SWD-NE (n = 5)	89.03 (38.83)	74.20 (36.67)	−14.83	0.45
SWOD (n = 6)	74.53 (41.26)	74.75 (49.49)	0.22	0.08

Note. Effect size estimates were computed using Cohen’s d for independent samples. The numerator was the mean difference between enhanced minus original and the denominator was the pooled standard deviation. SWD-E = students with disabilities considered eligible for an alternate assessment based on modified achievement standards (AA-MAS); SWD-NE = students with disabilities not considered eligible for an AA-MAS; SWOD = students without disabilities.

All three student groups reported expending less effort on the enhanced versions of test items as compared with the original versions. SWD-E reported the greatest amount of cognitive effort required to solve enhanced items and also showed the smallest mean difference in level of cognitive effort for enhanced items versus original items. The effect sizes were small for SWD-E, medium for SWD-NE, and large for SWOD.

Little difference was seen in the amount of time SWOD spent on original items compared with enhanced items. Both groups of students with disabilities spent more time solving original items than enhanced items. The effect sizes were small for SWD-E, medium for SWD-NE, and minimal for SWOD.

Research Question 4: Outcome Measures by Item Type and Eligibility Group

Results on the three outcome measures were also summarized by item type (stand-alone or passage-based) and eligibility group (see Table 8). SWD-E performed better on the stand-alone items than on the passage-based items and had a large effect size estimate. SWD-NE and SWOD had similar mean performance on the two types of items indicating that the passage-based items presented greater difficulty for AA-MAS-eligible students compared with the other two groups.

Table 8.

Testing Outcome Measures by Item Type and Group.

	M (SD)
Outcome measure	Stand-alone	Passage-based	Mean difference	Effect size
Raw test score (0 to 3)
Total (n = 18)	2.33 (0.69)	1.83 (0.92)	0.50	0.65
SWD-E (n = 7)	2.29 (0.76)	1.14 (0.69)	1.15	1.86
SWD-NE (n = 5)	2.00 (0.71)	2.00 (1.00)	0.00	0.00
SWOD (n = 6)	2.67 (0.52)	2.50 (0.55)	0.17	0.39
Cognitive effort (1 to 5)
Total (n = 18)	2.43 (1.28)	2.96 (1.12)	−0.53	0.47
SWD-E (n = 7)	2.62 (1.40)	3.14 (1.20)	−0.52	0.46
SWD-NE (n = 5)	1.87 (1.06)	3.07 (1.16)	−1.20	1.37
SWOD (n = 6)	2.67 (1.24)	2.67 (0.97)	0.00	0.00
Time spent (seconds)
Total (n = 18)	69.30 (36.37)	107.58 (67.39)	−38.29	0.75
SWD-E (n = 7)	79.86 (47.94)	130.43 (82.22)	−50.57	0.94
SWD-NE (n = 5)	70.10 (23.15)	93.13 (46.50)	−23.03	0.74
SWOD (n = 6)	56.31 (23.25)	92.97 (54.21)	−36.67	1.08

Note. Effect size estimates were computed using Cohen’s d for independent samples. The numerator was the mean difference between passage based minus stand alone and the denominator was the pooled standard deviation. SWD-E = students with disabilities considered eligible for an alternate assessment based on modified achievement standards (AA-MAS); SWD-NE = students with disabilities not considered eligible for an AA-MAS; SWOD = students without disabilities.

SWD-NE had the greatest mean difference in perception of cognitive effort required to solve between passage-based items and stand-alone items with a large effect size. SWD-E had the second highest mean difference with a medium effect size and SWOD had the same mean cognitive effort values for stand-alone and passage-based items. Students with disabilities reported perceiving the passage-based items as more challenging to solve compared with SWOD.

SWD-E had the largest mean difference in time spent solving passage-based items compared with stand-alone items (mean difference of 51 s). This group also had the greatest variability in time spent solving for both item types. SWOD had the second greatest and SWD-NE had the lowest mean difference in time spent solving the two types of items. SWD-E tended to take longer to solve problems and performed worse on passage-based items compared with stand-alone items. Effect sizes for time differentials by items type were large for all three eligibility groups.

Research Question 5: Perceptions of Enhancement Features

Table 9 presents data on the participants’ perceptions of the helpfulness of the enhancements collapsed across eligibility groups. Preliminary analysis revealed similar patterns across all groups. The “whole package” results were obtained from students’ responses to the question pertaining to item order (i.e., were items easier at the beginning or at the end?). Participants reported that the enhancements, as a whole, helped them better understand the items. Most participants also indicated that adding visuals, removing one distractor, and providing a read-aloud feature specifically made it easier to understand the question. For all eligibility groups, the read-aloud feature was reported by the greatest percentage of students as being helpful in understanding the item.

Table 9.

Student Perceptions of Item Enhancement Features by Form.

	Form 00 (passage-based items enhanced)			Form 01 (stand-alone items enhanced)			Total
Item feature	Enhanced better	Same	Enhanced worse	Enhanced better	Same	Enhanced worse	Enhanced better	Same	Enhanced worse
Whole package	3 (33.3%)	3 (33.3%)	3 (33.3%)	6 (66.7%)	3 (33.3%)	0 (0.0%)	9 (50.0%)	6 (33.3%)	3 (16.7%)
Visuals^a	4 (44.4%)	3 (33.3%)	2 (22.2%)	6 (66.7%^a)	0 (0.0%)	0 (0.0%)	10 (55.6%)	3 (16.7%)	2 (11.1%)
Removed distractor	4 (44.4%)	4 (44.4%)	1 (11.1%)	5 (55.6%)	3 (33.3%)	1 (11.1%)	9 (50.0%)	7 (38.9%)	2 (11.1%)
Read-aloud support	6 (66.7%)	2 (22.2%)	1 (11.1%)	6 (66.7%)	3 (33.3%)	0 (0.0%)	12 (66.7%)	5 (27.8%)	1 (5.6%)

Note. Extra white space, simplified language, and shortened stem were not included in this table as participants were not asked about the benefit of these enhancement features. Enhanced better = easier/helpful; Enhanced worse = harder/not helpful.

Three students (33.3%) provided a response that was inconsistent with the answer options for the question about visuals. These responses are not reported in this table.

Table 9 also summarizes students’ perceptions of item-enhancement features by test form. More students who took the enhanced stand-alone items (Form 01) indicated that enhancements, as a whole package, made solving the problems easier. The fact that stand-alone items were presented at the beginning and passage-based items were presented at the end on both forms may have also influenced students’ responses to this item. In addition, students who took the enhanced stand-alone items (Form 01) more frequently indicated that visuals were helpful than students who took the enhanced passage-based items (Form 00). However, three responses to the question on the helpfulness of visuals were inconsistent with the choices for stand-alone items. Different visuals were presented with each of the stand-alone items whereas the same visual was presented for all three passage-based items. The students with inconsistent responses about visuals indicated that sometimes visuals were helpful, suggesting that the type of visual was more important than simply having a visual. Students on both forms reported that read-aloud support and removal of a distractor were helpful at similar rates.

Discussion

Historically, the population of students for which the AA-MAS is intended has been largely overlooked in assessment. As a measure of their progress and achievement, they have had to take either grade-level assessments with the same format and achievement levels as students without disabilities or take assessments based on content below their cognitive capacity. The structured item-development process undertaken in this study helps to ensure the validity of an AA-MAS to reflect the expectations for and the performance of these students ultimately impacting the validity of the state’s overall accountability efforts.

Results from the cognitive lab study provided preliminary information on student performance on original versus enhanced items and on student preferences of item- enhancement features to inform item development for an AA-MAS in high school science. Data collected on item performance did not show much improvement. Only SWOD had a greater mean number of correct responses on the enhanced compared with the original items, whereas both groups of students with disabilities had the same mean scores in both conditions. However, the cognitive effort measures (perceived student rating and time spent solving) were lower among enhanced than original items. This suggests decreased perceived difficulty for the enhanced items. Students also perceived the enhancement features as helpful, both individually and as a whole. Among the enhancements, students tended to like the read-aloud support best. There was a reduction in the relationship between item scores and the ORF measure with the enhanced items compared with the original items. These findings suggest that enhancements increased access to science content, in part by minimizing the influence of reading ability on student performance.

Comparisons by item type (stand-alone or passage- based) revealed differences in performance between the two types of items. For AA-MAS-eligible students, average performance was lower on passage-based items compared with stand-alone items. The passage-based items were based on a complex scenario and required greater cognitive effort for students with disabilities and time to solve for all students than the stand-alone items, on average. As seen in the ORF measure results, reading ability tends to be lower for students in the eligible group. Therefore, these students may benefit from enhancements such as simplified language and read-aloud support. Collectively, these findings suggest that passage-based items may be particularly challenging for AA-MAS-eligible students and may require further enhancements.

The passage-based items presented more information and were more difficult for the AA-MAS-eligible students as compared with stand-alone items. Mixed between original and enhanced versions of the items, however, there was no difference in average raw test scores for the AA-MAS-eligible group. Although the effect of item condition cannot be separated from item type, there is indication that the package of enhancements may be helpful on passage-based items for students who would take an AA-MAS in high school science.

In evaluating the benefits of enhancements, the decision to include a graphic and the type of graphic presented are important. Follow-up analysis revealed two situations where the graphic needed to be adjusted but for different reasons. For one item, adding a graphic may have increased the cognitive load for the item, thus making it more difficult rather than increasing access to the content. According to Haladyna et al. (2002), each item should reflect specific content and require a single specific mental behavior; however, the graphic in this case may have added a mental processing step, complicating the problem-solving process. This interpretation is consistent with CLT, which indicates that extraneous information increases cognitive load and may inhibit problem solving.

On another item, the supporting graphic may have provided unintentional clues that ruled out the distracters as reasonable answer options and made the item easier. When evaluating graphics for inclusion as a means to increase access to content, it is important that the graphic does not provide extraneous information. Rather, graphics should represent the information presented in the text or help visually define a process being described in the item (Gaster & Clark, 1995).

Despite lack of evidence of improved performance in this study, the enhancements provided to students were perceived as useful, resulted in decreased ratings of cognitive effort, and took less time to solve. In this case, the enhancements (e.g., read-aloud, visual, reduced reading load) may have had a desired impact of decreased cognitive load, but the students may not have had the appropriate content knowledge to answer correctly. In essence, these enhancements may have helped differentiate more clearly the students’ lack of content knowledge. Another possibility is that the confounding of condition and item type may have masked performance effects. Only 6 items were considered in this preliminary study and students received either all stand-alone items or all passage-based items in enhanced versions. The OAASIS pilot assessment consisted of 40 items with a balance between item types on original and enhanced sections, and revealed improved performance on enhanced items for all student groups (Dickenson et al., 2011).

Limitations

Because of the nature of cognitive lab studies for item development, data were collected on a small sample of students for a small number of test items. Small sample size was necessary due to the detailed information collected in the think-aloud format, and generalizations from this study are not possible. Rather, this study provided formative feedback on item enhancements to guide the development of items for use in the development of an AA-MAS in high school science.

Another limitation was how the item types were organized across test forms. Test forms (Form 00, Form 01) used during the cognitive lab study enhanced only one item type per form. Students received all enhancements on either stand-alone items or all enhancements on passage-based items. Therefore, comparing the value of enhancements proved difficult due to confounding of item type with condition. The fact that all students received enhancements for a specific type of question may have masked the overall achievement impact of the enhancements.

Finally, the enhancement features were presented as an overall package of enhancements, including read-aloud support, removal of a distractor, presentation of a graphic, and so on. This packaging approach limited the ability to determine which enhancement(s) had the largest impact on students’ achievement and cognitive effort required to solve the problem. To further evaluate the benefit of individual enhancements, it would be necessary to isolate each type of enhancement. This was not practical for the small number of test items investigated in this study.

Recommendations

Although item enhancements did not improve performance, they did reduce cognitive effort and time spent solving items for all students. This provides evidence that benefits of the enhancement features intended to increase access to science content for AA-MAS-eligible student extend to other students as well. For all students, the ease and speed with which they process incoming information is a potential barrier to efficient learning or demonstration of content-area knowledge on an assessment. Therefore, the fact that students reported a decreased cognitive effort in solving problems would suggest that the item enhancements served to decrease extraneous load, allowing the student to focus on the construct and allocate more cognitive resources to problem solving. Decreased cognitive load may be particularly important when students are completing tests that include 40 items where fatigue may occur.

In creating items for an AA-MAS, enhancements should increase access to content and not just reduce item difficulty. That is, enhancements should increase the likelihood that students with content knowledge will answer correctly by removing barriers associated with the students’ disabilities that may prohibit students from demonstrating mastery. Also, although an enhancement may lower the reading level of a test question, the content measured should be retained so that the construct does not change. Although basic item-writing guidelines form the basis for effective item writing in general, it is necessary to better understand how students with disabilities process information and learn. The cognitive lab study was an initial step to this end and provided information on how enhancements may increase access to the content for students with disabilities. This study provided a framework for the systematic process of evaluating item enhancements that was founded in theory to inform the development of an AA-MAS.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by the U.S. Department of Education under the Enhanced Assessment Grant program 84-368A, grant award number S368AO70012.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: Author.

Brower

D. M.

Spooner

Algozzine

Ahlgrim-Delzell

Flowers

Karvonen

(2003). What we know and need to know about alternate assessment. Council for Exceptional Children, 70, 45–61.

Butler

Stevens

(2001). Standardized assessment of the content knowledge of English language learners K-12: Current trends and old dilemmas. Language Testing, 18, 409–427.

Case

S. M.

Swanson

(2001). Constructing written test questions for the basic and clinical sciences (3rd ed.). Philadelphia, PA: National Board of Medical Examiners.

Center for Applied Special Technology. (2011). CAST: About UDL. Retrieved from http://www.cast.org/udl/index.html

Clark

Nguyen

Sweller

(2006). Efficiency in learning: Evidence-based guidelines to manage cognitive load. San Francisco, CA: Pfeiffer.

Dickenson

T. S.

Bennett

H. L.

Beddow

P. A.

Kettler

R. J.

Morgan

Gilmore

(2011, April). Effects of modification on science tests and items for high school students in three states. Paper presented at the meeting of the National Council of Measurement on Education. New Orleans, LA.

Deno

S. L.

Mirkin

P. K.

(1977). Data-based program modification: A manual. Reston, VA: Council for Exceptional Children.

Dolan

R. P.

Hall

T. E.

(2001). Universal design for learning: Implications for large-scale assessment IDA. Perspectives, 27(4), 22–25.

10.

Elliott

S. N.

Marquart

A. M.

(2004). Extended time as a testing accommodation: Its effects and perceived consequences. Exceptional Children, 70, 349–367.

11.

Ericsson

K. A.

Simon

H. A.

(1993). Protocol analysis: Verbal reports as data (Rev. ed.). Cambridge, MA: MIT Press.

12.

Ericsson

K. A.

Simon

(1998). How to study thinking in everyday life: Contrasting think-aloud protocols with descriptions and explanations of thinking. Mind, Culture, and Activity, 5, 178–186.

13.

Fulk

C. L.

Smith

P. J.

(1995). Students’ perceptions of teachers’ instructional and management adaptations for students with learning or behavior problems. Elementary School Journal, 95, 409–419.

14.

Gaster

Clark

(1995). A guide to providing alternate formats. West Columbia, SC: Center for Rehabilitation Technology Services. Retrieved from ERIC database. (ED405689)

15.

Gronlund

N. E.

(2003). Assessment of student achievement (7th ed.). Boston, MA: Allyn & Bacon.

16.

Haigh

(1999). Accommodations, modifications, and alternates for instruction and assessment (Maryland/Kentucky Report No. 5). Minneapolis, MN: University of Minnesota. Retrieved from http://www.cehd.umn.edu/nceo/OnlinePubs/archive/AssessmentSeries/MdKy5.html

17.

Haladyna

T. M.

Downing

S. M.

Rodriguez

M. C.

(2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15, 309–344.

18.

Helwig

Tindal

(2003). An experimental analysis of accommodation decisions on large-scale mathematics tests. Exceptional Children, 69, 211–225.

19.

Johnson

Monroe

(2004). Simplified language as an accommodation on math tests. Assessment for Effective Intervention, 29(3), 35–45.

20.

Johnstone

C. J.

(2003). Improving validity of large-scale tests: Universal design and student performance (Technical Report No. 37). Minneapolis: University of Minnesota.

21.

Johnstone

C. J.

Bottsford-Miller

N. A.

Thompson

S. J.

(2006). Using the think-aloud methods (cognitive labs) to evaluate test design for students with disabilities and English language learners. Minneapolis: University of Minnesota. Retrieved from ERIC database. (ED495909)

22.

Johnstone

C. J.

Thurlow

Moore

Altman

(2006). Using systematic item selection methods to improve universal design of assessment (Policy Directions 18). Minneapolis: University of Minnesota. Retrieved from http://education.umn.edu/NCEO/OnlinePubs/Policy18/

23.

Kettler

R. J.

Elliott

S. N.

Beddow

P. A.

(2009). Modifying achievement test items: A theory-guided and data-based approach for better measurement of what students with disabilities know. Peabody Journal of Education, 84, 529–551.

24.

Koretz

D. M.

Hamilton

L. S.

(2006). Testing for accountability in K-12. In Brennan

R. L.

(Ed.), Educational measurement (4th ed., pp. 531–578). Westport, CT: American Council on Education and Praeger Publishers.

25.

Kosciolek

Ysseldyke

J. E.

(2000). Effects of a reading accommodation on the validity of a reading test (Tech. Report No. 28). Minneapolis: University of Minnesota. Retrieved from http://education.umn.edu/NCEO/OnlinePubs/Technical28.htm

26.

Lang

S. C.

Elliott

S. N.

Bolt

D. M.

Kratochwill

T. R.

(2008). The effects of testing accommodations on students’ performances and reactions to testing. School Psychology Quarterly, 14, 107–124.

27.

Lau

K. L.

(2006). Reading strategy use between Chinese good and poor readers: A think-aloud study. Journal of Research in Reading, 29, 383–399.

28.

Lazarus

S. S.

Thurlow

M. L.

(2009). The changing landscape of alternate assessments based on modified academic achievement standards: An analysis of early adopters of AA-MASs. Peabody Journal of Education, 84, 496–510.

29.

McKevitt

B. C.

Elliott

S. N.

(2003). The use of testing accommodations on a standardized reading test: Effects on scores and attitudes about testing. School Psychology Review, 32, 583–600.

30.

National Center on Educational Outcomes. (2007). 2007 survey of states: Activities, changes, and challenges for special education. Minneapolis: University of Minnesota. Retrieved from http://www.cehd.umn.edu/NCEO/OnlinePubs/2007StateSurvey/2007StateSurveyReport.pdf

31.

National Center on Educational Outcomes. (2010a). A summary of research on the effects of test accommodations: 2007–2008. Minneapolis: University of Minnesota. Retrieved from http://www.cehd.umn.edu/NCEO/OnlinePubs/Tech56/TechnicalReport56.pdf

32.

National Center on Educational Outcomes. (2010b). States participation guidelines for alternate assessments based on modified academic achievement standards (AA-MAS) in 2009. Minneapolis: University of Minnesota. Retrieved from http://www.cehd.umn.edu/NCEO/OnlinePubs/Synthesis75/Synthesis75.pdf

33.

Ohio Literacy Alliance. (n.d.). Becoming a “star sailor.” Retrieved from http://www.ohioliteracyalliance.org/fluency/

34.

Phillips

S. E.

Camara

W. J.

(2006). Legal and ethical issues. In Brennan

R. L.

(Ed.), Educational measurement (4th ed., 733–757). Westport, CT: American Council on Education and Praeger Publishers.

35.

Quenemoen

Thompson

Thurlow

(2003). Measuring academic achievement of students with significant cognitive disabilities: Building understanding of alternate assessment scoring criteria (Synthesis Report No. 50). Minneapolis: University of Minnesota.

36.

Roach

A. T.

Beddow

P. A.

Kurz

Kettler

R. J.

Elliott

S. N.

(2010). Incorporating student input in developing alternate assessment based on modified achievement standards. Exceptional Children, 77, 61–80.

37.

Sireci

S. G.

S. E.

Scarpati

(2005). Test accommodations for students with disabilities: An analysis of the interaction hypothesis. Review of Educational Research, 75, 457–490.

38.

Smith

J. M.

McCombs

M. E.

(1971). Research in brief: The graphics of prose. Visible Language, 5, 365–369.

39.

Sweller

(1988). Cognitive load during problem solving: Effects on learning. Cognitive Science, 12, 257–285.

40.

Thompson

S. J.

Johnstone

C. J.

Anderson

M. E.

Miller

N. A.

(2005). Considerations for the development and review of universally designed assessments (Technical Report No. 42). Minneapolis: University of Minnesota.

41.

Thurlow

M. L.

Lazarus

S. S.

Thompson

S. J.

Blount Morse

(2005). State policies on assessment participation and accommodations for students with disabilities. Journal of Special Education, 38, 232–240.

42.

Ysseldyke

Olson

(1999). Putting alternate assessments into practice: What to measure and possible sources of data. Exceptional Children, 65, 178–185.