Abstract
Background:
The concept of reliability is central to conducting—and understanding—research in Psychology. Students’ understanding of concepts are strengthened when they learn by applying concepts.
Objective:
This article describes initial evidence of an activity for teaching reliability.
Method:
Students watched a short video of a staged bank robbery. They then tested the reliability of two different forms of police instructions for eyewitness recall. In so doing, they gained practice at calculating and interpreting inter-rater reliability and test-retest reliability.
Results:
Data collected from N = 191 students indicates that the exercise has a statistically significant positive effect on student understanding of and confidence about reliability concepts contributes to a roughly 20% increase in performance when comparing responses on pre- and post-exercise multiple choice questions.
Conclusion:
The activity gives students practice with the concept of reliability in a way that is engaging and memorable insofar as it demonstrates the implications of reliability for the real world.
Teaching Implications:
The activity is straightforward to implement and encourages students to learn by “doing.”
Reliability is a fundamental concept in research methodology. Broadly speaking, reliability refers to the extent to which something predictably repeats across time, across situations, and across and within individuals. Reliability is especially important in psychology. We deal with humans, and humans vary in the way they think and feel and behave. Accordingly, when we attempt to measure or induce thoughts, feelings, and behaviors, there will always be natural variability and random error. Thus, a central question for researchers and students in psychology revolves around reliability: Are the methods and measures we are using capable of producing similar outcomes in different contexts, at different time points, between different people, and within the same person? The more confident we can be about a method or measure, the more confident we can be that our findings reflect a ‘truth’ about humanity.
Given the centrality of reliability in psychological research, educators have developed many and varied exercises for teaching the concept to undergraduate students. For example, students have learnt about reliability by engaging in behavioral observation of a mice colony (Herzog, 1988), developing personality scales (e.g., Camac & Camac, 1993; Miserandino, 2006), examining effects of position preference set on test responses (Buck, 1991), measuring vertical lines (Moore, 1981), unobtrusively observing strangers in public places (e.g., Wilson & Joye, 2017), and randomly generating sample sizes to examine effects on study outcomes (Strube, 1991).
This article describes a new exercise for teaching different aspects of reliability in a way that is highly engaging for students and easy for educators to adapt. The exercise draws upon several well-established pedagogical approaches. First—and as with the other exercises noted above—it is based upon the principles of active learning (e.g., Prince, 2004), which refers to learning by doing but also thinking about what one is doing when engaging in a task. Active learning has been shown to be more effective than passive learning, which is simply reading or hearing about a concept (for a meta-analysis of the effects of active over passive learning, see Freeman et al., 2014). Second, the exercise uses film as the teaching medium, and film has been shown to be an effective resource (e.g., Strelan, 2018) for its ability to bring the world into the classroom and its accessibility (e.g., Waller et al., 2013). Third, the exercise is rooted in a context which has salient real-world implications. There is evidence to suggest that deeper understanding of key concepts is more likely to eventuate when practical applications are highlighted (e.g., Brundiers et al., 2010).
The Exercise: Eyewitness Recall of a Bank Robbery
The exercise is designed to be used in tutorials but could easily be adapted to a lecture theater setting. The exercise presumes that students have already been introduced to fundamental concepts relating to reliability. Nonetheless, it would be useful for the instructor to remind students of definitions of different types of reliability at the beginning of the session and throughout where appropriate. Because reliability and validity are closely entwined, it would be prudent to remind students of the differences between these two concepts but emphasize that the focus in the present exercise is on reliability. Because the exercise features a video of a simulated bank robbery, the instructor should also provide a trigger warning prior to the class.
The exercise involves students recalling what they witnessed during the bank robbery. Students watch the video on three different occasions in order to provide data that enables them to get practice with inter-rater reliability and test-retest reliability. It is important to note that while reliable eyewitness memory is used as the entry point to engage students, the exercise is actually about the reliability of instructions that police may give to eyewitnesses when asking them to recall a crime. The exercise has five components. Materials are available from the author upon request.
Part 1: Provide context
First the instructor explains the important role that eyewitness memory plays in the justice system. After acknowledging that issues of validity are also vital, the instructor points out that police often rely on witnesses reliably recalling what they saw, across multiple witnesses (inter-rater reliability) and/or within witnesses across time (test-retest reliability). The extent to which judges and juries are able to rely upon eyewitness testimony can affect whether a person is found guilty or not of committing a crime. The consequences can be profound. On one hand, if a perpetrator is indeed found guilty, justice has been served. On the other hand, lives can be ruined when innocent people are wrongfully convicted (for a review and discussion of eyewitness memory research and its implications, see Albright, 2017). While eyewitness memory is obviously an important issue in and of itself, the aim in providing such context is to make it clear to students that reliability is not an abstract concept. Reliability can literally have life or death implications, as exemplified when some people found guilty are rightfully or wrongly sentenced to long prison terms or even death primarily on the basis of eyewitness testimony. In short, reliability has important real-world application. (There is an opportunity here for the instructor to elaborate other contexts, for example, discuss how it is vital to ensure that measures of psychological disorders are reliable otherwise incorrect diagnoses could be made with potentially life-changing consequences.)
Part 2: First practice with inter-rater reliability
Next the instructor informs the class that they are going to engage in an exercise designed to find out if they would be reliable eyewitnesses. The students watch a 35-s video of a simulated bank robbery (as developed by and reported in Tuckey & Brewer, 2003). At the completion of the video, the students are simply instructed to “describe the offender” on a piece of paper or their devices. Note that for learning purposes, this instruction is deliberately vague. They then work in pairs to calculate inter-rater reliability using the following formula. # agreements × 100 / # agreements + # disagreements
After all pairs have completed the task, they share their inter-rater agreement percentage with the class. In most cases the results will be poor, that is, (usually much) less than the rule of thumb of .70 as an acceptable internal reliability coefficient (or in this case, 70% agreement). The instructor should encourage students to think about the real-world implications when eyewitnesses agree or disagree about what they saw in relation to a serious crime. Students should then discuss why the inter-rater agreements were generally so poor. Students will arrive at several inter-related conclusions: (1) the instructions provided for recall were too vague or general; therefore (2) because they were individuals, they each interpreted the instructions in their own way; and (3) in the absence of any instructions prior to watching the video, individual differences dictated that they paid attention to different aspects of the scene, including not necessarily focusing on the offender (just as in real-life).
Occasionally, pairs will report good inter-rater agreement. Usually this will be because both individuals in those dyads each happened to write very little in their descriptions or wrote broad statements, thus making it easier to agree. Alternatively, higher inter-rater agreements can occur because pairs made calculation errors or were not stringent about agreements. Finally, sometimes the tutorial group will have an odd number. In this case there will be a group of three (and sometimes groups of three may occur naturally), which provides an opportunity for most likely even poorer inter-rater reliabilities.
Part 3: Second practice with inter-rater reliability
Once the class has shared and discussed their insights, they should be encouraged to think of what could be done to improve inter-rater reliability. The short answer is that this can be achieved by improving the instructions by making them more specific. The instructor shows the same video again. Immediately after, each student is provided with a checklist of 14 items relating to the main offender, e.g., “What was the colour of their shirt?” The members of each dyad repeat the same process as per the first viewing and they discuss as a class their levels of inter-rater agreement. This time agreement percentages will be much improved. The instructor should prompt the class to discuss why this was the case. Students will quickly realize that having a checklist—in other words, more specific instructions—made their jobs easier.
Part 4: Practice with test-retest reliability
The students watch the same video a third time and complete the checklist again (the instructor needs to ensure that two checklists are provided, and that students do not look at their first checklist when completing the second one). The students now work individually. They use the same formula as the first two viewings but this time they calculate test-retest reliability on their own agreements and disagreements between Viewing #2 and Viewing #3. They share their percentages with the rest of the class, which are now usually 100% or thereabouts. The instructor encourages the class to think about why test-retest agreements were so high and why they tend to be higher than inter-rater agreements. While students will comment on the influence of practice effects, they should also recognize that variability is reduced when the same person employs the same checklist across two time points compared to when two different people are required to interpret the checklist.
Part 5: General discussion points
The final part of the exercise provides the instructor with an opportunity to help students see links between reliability and other related (and important) concepts in research methods, in particular operationalization and validity. For example, educators could pose the following set of questions for students: [a] What construct do you think was being operationalized in this activity? [b] What does this activity teach us about the related concepts of operationalisation and reliability? [c] How did the quality of instructions affect inter-rater reliability? [d] In this exercise, we assessed test-retest reliability in the space of 5-10 minutes. In the real-world context of eyewitness recall, what might the spacing between testing sessions be? What would be most appropriate? What implications do you think the spacing might have for reliability coefficients and, therefore, criminal convictions? [e] What is the difference between reliability and validity? How can you illustrate it based on the tasks we have just completed? [f] Regardless of how specific or vague the instructions, were they valid? [g] What are the real-world implications in terms of police procedure for encouraging accurate eyewitness memory recall?
Empirical Evidence for the Efficacy of the Exercise
Data were collected from 191 students enrolled in an Introduction to Research Methods in Psychology undergraduate course. The students participated in the exercise as part of their tutorial program. Prior to attending the tutorial students were expected to have engaged with a pre-recorded online presentation that introduced them to fundamental concepts relating to reliability. A within-participants design was employed. At the start of the tutorial students completed a four-item multiple choice quiz. Each item contained four responses but only one was correct. Students were instructed to choose the response that best answered the question. The items were concerned with (1) a broad definition of reliability; (2) a definition of inter-rater reliability; (3) a definition of test-retest reliability; and (4) the relation between reliability and operationalization. Participants also responded to a fifth item, “How confident are you that you understand the concept of reliability?” on a scale increasing in 10% increments from 0% to 100%.
Students were not given feedback on their answers. At the end of the tutorial, they responded to the same items again, after which feedback was given. All multiple-choice responses were coded so that a correct answer was coded as 1 and an incorrect answer (i.e., the other three responses) was coded as 0. Paired sample t-tests were conducted for each of the four pre and post multiple-choice items and the single confidence item. Descriptive and inferential statistics are summarized in Table 1.
Descriptive and Inferential Statistics for Testing Differences Between Pre- and Post-Exercise Knowledge and Confidence.
Note. N = 191.
As shown in Table 1, there were statistically significant improvements in quiz performance on three out of the four questions, specifically, those relating to reliability in general, inter-rater reliability, and the implications of good operationalization for reliability. There was no significant difference on the item concerned with test-retest reliability, although means were in the anticipated direction. Effect sizes were weak, according to Cohen’s (1992) rules of thumb (i.e., a d value around .20 is interpreted as weak; a d value around .50 is interpreted as medium; a d value around .80 is interpreted as large). There was also a significant difference in student confidence ratings. That is, students were significantly more confident about their understanding of reliability after the exercise (mean ratings were around 75%) compared with before the exercise (mean ratings were around 55%. The effect size for confidence was strong. In short there is evidence that engaging in the exercise helped students to better understand concepts of reliability, and it helped them to feel more confident.
Next, a crosstab analysis was used to establish the percentage of students who improved after being initially incorrect. These analyses are summarized in Table 2. Table 2 indicates the percentage of student who were incorrect at both pre- and post-tests (e.g., for the first question, “Reliability refers to . . , ” 9% of students were wrong on both occasions; these students did not improve); incorrect at the pre-test but correct on the post-test (e.g., for the first question, this was 17% of students; these are the students who improved); correct on the pre-test but incorrect on the post-test (for the first question, this was 7% of students; these students performed worse after engaging in the exercise); and correct on both the pre- and post-tests (for the first question, this was 67% of students).
Considering the first three items together (i.e., concerned with reliability in general, inter-rater reliability, and test-retest reliability), 43%–67% of students (depending on the item) were correct across both pre- and post-tests, and 9%–23% were incorrect across both pre- and post-tests. While 7%–13% of students did worse on the post-test, it is notable that 17%–22% of students improved as a result of engaging with the exercise.
Percentage of Correct and Incorrect Responses Across Pre- and Postexercise Responses.
Note. Percentages are rounded up therefore totals will not always equal 100.
For the fourth item (referring to the relation between operationalization and reliability), 54% were unable to improve post-test and 8% performed worse post-test compared with 22% who did improve. While the percentage of students who improved on this item was consistent with the other three items, it is notable that only 16% of students who were correct pre-test were correct post-test (compared with between 43%–67% on the other items). Clearly, this was a challenging item for students, possibly because the alternative response choices referred to “accuracy” and “validity” of measurement and “all of the above.” It is possible that the results for this item reflect the difficulty students often have in disentangling reliability from validity.
Conclusion and Discussion
Taken together, these data indicate that The Bank Robbery exercise helps to improve students’ understanding of reliability and increases their confidence about their understanding of the concept. Also, the exercise is easy for instructors to implement. While instructors can obtain a copy of the video by contacting the author, they could use any recording of behavior with permission.
The exercise could also be used to impress upon students the fundamental importance of good operationalization in the measurement process. While no data were collected to demonstrate the efficacy of the exercise for teaching operationalization, the students’ experiences during this exercise helps them to see from their very own data how inter-rater reliability improves once instructions are made clear and specific—in other words, better operationalized.
There may also be opportunity to demonstrate the vagaries of test-retest reliability by running the activity across two sessions, for example, by asking students in a subsequent tutorial to complete the checklist again. Doing so would more closely reflect the process of confirming eyewitness recall in real world policing contexts and provide educators and students with an opportunity to discuss additional effects on test-retest reliability (e.g., time, historical events).
Finally, the exercise deliberately separates reliability from validity in this exercise. As responses to one of the multiple-choice items suggests, and as other educators have no doubt found through their own experiences, students often struggle to distinguish between reliability and validity. Of course, the two are inextricably entwined. However, students are arguably better able to understand their differences and their relationship when the concepts are introduced as separate entities in separate teaching exercises. Nonetheless, there is scope within the present exercise for instructors to insert discussion of validity. For example, teachers could ask students to decide if initial instructions to simply describe the offender are valid (they are, but they are not reliable); ask students if the second, more specific set of instructions are valid (they are, and they are also reliable); and ask them if describing the color of the walls in the tutorial room is a valid way to measure eyewitness recall (it is not, but it would still produce reliable results) would help them to start seeing how the two concepts are related.
Footnotes
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
