The Bank Robbery: A Behavioral Observation Exercise for Enhancing Understanding of Reliability

Abstract

Background:

The concept of reliability is central to conducting—and understanding—research in Psychology. Students’ understanding of concepts are strengthened when they learn by applying concepts.

Objective:

This article describes initial evidence of an activity for teaching reliability.

Method:

Students watched a short video of a staged bank robbery. They then tested the reliability of two different forms of police instructions for eyewitness recall. In so doing, they gained practice at calculating and interpreting inter-rater reliability and test-retest reliability.

Results:

Data collected from N = 191 students indicates that the exercise has a statistically significant positive effect on student understanding of and confidence about reliability concepts contributes to a roughly 20% increase in performance when comparing responses on pre- and post-exercise multiple choice questions.

Conclusion:

The activity gives students practice with the concept of reliability in a way that is engaging and memorable insofar as it demonstrates the implications of reliability for the real world.

Teaching Implications:

The activity is straightforward to implement and encourages students to learn by “doing.”

Keywords

reliability test-retest reliability inter-rater reliability validity video

Reliability is a fundamental concept in research methodology. Broadly speaking, reliability refers to the extent to which something predictably repeats across time, across situations, and across and within individuals. Reliability is especially important in psychology. We deal with humans, and humans vary in the way they think and feel and behave. Accordingly, when we attempt to measure or induce thoughts, feelings, and behaviors, there will always be natural variability and random error. Thus, a central question for researchers and students in psychology revolves around reliability: Are the methods and measures we are using capable of producing similar outcomes in different contexts, at different time points, between different people, and within the same person? The more confident we can be about a method or measure, the more confident we can be that our findings reflect a ‘truth’ about humanity.

Given the centrality of reliability in psychological research, educators have developed many and varied exercises for teaching the concept to undergraduate students. For example, students have learnt about reliability by engaging in behavioral observation of a mice colony (Herzog, 1988), developing personality scales (e.g., Camac & Camac, 1993; Miserandino, 2006), examining effects of position preference set on test responses (Buck, 1991), measuring vertical lines (Moore, 1981), unobtrusively observing strangers in public places (e.g., Wilson & Joye, 2017), and randomly generating sample sizes to examine effects on study outcomes (Strube, 1991).

This article describes a new exercise for teaching different aspects of reliability in a way that is highly engaging for students and easy for educators to adapt. The exercise draws upon several well-established pedagogical approaches. First—and as with the other exercises noted above—it is based upon the principles of active learning (e.g., Prince, 2004), which refers to learning by doing but also thinking about what one is doing when engaging in a task. Active learning has been shown to be more effective than passive learning, which is simply reading or hearing about a concept (for a meta-analysis of the effects of active over passive learning, see Freeman et al., 2014). Second, the exercise uses film as the teaching medium, and film has been shown to be an effective resource (e.g., Strelan, 2018) for its ability to bring the world into the classroom and its accessibility (e.g., Waller et al., 2013). Third, the exercise is rooted in a context which has salient real-world implications. There is evidence to suggest that deeper understanding of key concepts is more likely to eventuate when practical applications are highlighted (e.g., Brundiers et al., 2010).

The Exercise: Eyewitness Recall of a Bank Robbery

The exercise is designed to be used in tutorials but could easily be adapted to a lecture theater setting. The exercise presumes that students have already been introduced to fundamental concepts relating to reliability. Nonetheless, it would be useful for the instructor to remind students of definitions of different types of reliability at the beginning of the session and throughout where appropriate. Because reliability and validity are closely entwined, it would be prudent to remind students of the differences between these two concepts but emphasize that the focus in the present exercise is on reliability. Because the exercise features a video of a simulated bank robbery, the instructor should also provide a trigger warning prior to the class.

The exercise involves students recalling what they witnessed during the bank robbery. Students watch the video on three different occasions in order to provide data that enables them to get practice with inter-rater reliability and test-retest reliability. It is important to note that while reliable eyewitness memory is used as the entry point to engage students, the exercise is actually about the reliability of instructions that police may give to eyewitnesses when asking them to recall a crime. The exercise has five components. Materials are available from the author upon request.

Part 1: Provide context

First the instructor explains the important role that eyewitness memory plays in the justice system. After acknowledging that issues of validity are also vital, the instructor points out that police often rely on witnesses reliably recalling what they saw, across multiple witnesses (inter-rater reliability) and/or within witnesses across time (test-retest reliability). The extent to which judges and juries are able to rely upon eyewitness testimony can affect whether a person is found guilty or not of committing a crime. The consequences can be profound. On one hand, if a perpetrator is indeed found guilty, justice has been served. On the other hand, lives can be ruined when innocent people are wrongfully convicted (for a review and discussion of eyewitness memory research and its implications, see Albright, 2017). While eyewitness memory is obviously an important issue in and of itself, the aim in providing such context is to make it clear to students that reliability is not an abstract concept. Reliability can literally have life or death implications, as exemplified when some people found guilty are rightfully or wrongly sentenced to long prison terms or even death primarily on the basis of eyewitness testimony. In short, reliability has important real-world application. (There is an opportunity here for the instructor to elaborate other contexts, for example, discuss how it is vital to ensure that measures of psychological disorders are reliable otherwise incorrect diagnoses could be made with potentially life-changing consequences.)

Part 2: First practice with inter-rater reliability

Next the instructor informs the class that they are going to engage in an exercise designed to find out if they would be reliable eyewitnesses. The students watch a 35-s video of a simulated bank robbery (as developed by and reported in Tuckey & Brewer, 2003). At the completion of the video, the students are simply instructed to “describe the offender” on a piece of paper or their devices. Note that for learning purposes, this instruction is deliberately vague. They then work in pairs to calculate inter-rater reliability using the following formula.

# agreements × 100 / # agreements + # disagreements

For example, if there were five agreements and three disagreements, inter-rater reliability would be 62.5%. Because eyewitness accuracy is imperative in the real world of the justice system, the instructor should encourage students to be as stringent as possible in making decisions about agreements and disagreements. There is also a pedagogical imperative to ensure that agreement rules are strictly applied; apart from providing students with hands-on practice with inter-rater reliability, this part of the exercise also aims to demonstrate to students the importance of ensuring that constructs are appropriately operationalized. Thus, instructors should provide exemplars (e.g., “Two raters might agree that the offender wore a cap—but if one rater observes that the cap was white and other says it was grey, that is a disagreement”). In addition, while it may be valid in a real-world context to include agreements about what one did not see, accumulated experience with this activity suggests that this unnecessarily confuses students. Thus, it is recommended that students should be instructed that agreements about what they did not see do not constitute agreement (e.g., both raters agreed they did not recall the offender wearing a watch).

After all pairs have completed the task, they share their inter-rater agreement percentage with the class. In most cases the results will be poor, that is, (usually much) less than the rule of thumb of .70 as an acceptable internal reliability coefficient (or in this case, 70% agreement). The instructor should encourage students to think about the real-world implications when eyewitnesses agree or disagree about what they saw in relation to a serious crime. Students should then discuss why the inter-rater agreements were generally so poor. Students will arrive at several inter-related conclusions: (1) the instructions provided for recall were too vague or general; therefore (2) because they were individuals, they each interpreted the instructions in their own way; and (3) in the absence of any instructions prior to watching the video, individual differences dictated that they paid attention to different aspects of the scene, including not necessarily focusing on the offender (just as in real-life).

Occasionally, pairs will report good inter-rater agreement. Usually this will be because both individuals in those dyads each happened to write very little in their descriptions or wrote broad statements, thus making it easier to agree. Alternatively, higher inter-rater agreements can occur because pairs made calculation errors or were not stringent about agreements. Finally, sometimes the tutorial group will have an odd number. In this case there will be a group of three (and sometimes groups of three may occur naturally), which provides an opportunity for most likely even poorer inter-rater reliabilities.

Part 3: Second practice with inter-rater reliability

Once the class has shared and discussed their insights, they should be encouraged to think of what could be done to improve inter-rater reliability. The short answer is that this can be achieved by improving the instructions by making them more specific. The instructor shows the same video again. Immediately after, each student is provided with a checklist of 14 items relating to the main offender, e.g., “What was the colour of their shirt?” The members of each dyad repeat the same process as per the first viewing and they discuss as a class their levels of inter-rater agreement. This time agreement percentages will be much improved. The instructor should prompt the class to discuss why this was the case. Students will quickly realize that having a checklist—in other words, more specific instructions—made their jobs easier.

Part 4: Practice with test-retest reliability

The students watch the same video a third time and complete the checklist again (the instructor needs to ensure that two checklists are provided, and that students do not look at their first checklist when completing the second one). The students now work individually. They use the same formula as the first two viewings but this time they calculate test-retest reliability on their own agreements and disagreements between Viewing #2 and Viewing #3. They share their percentages with the rest of the class, which are now usually 100% or thereabouts. The instructor encourages the class to think about why test-retest agreements were so high and why they tend to be higher than inter-rater agreements. While students will comment on the influence of practice effects, they should also recognize that variability is reduced when the same person employs the same checklist across two time points compared to when two different people are required to interpret the checklist.

Part 5: General discussion points

The final part of the exercise provides the instructor with an opportunity to help students see links between reliability and other related (and important) concepts in research methods, in particular operationalization and validity. For example, educators could pose the following set of questions for students:

[a] What construct do you think was being operationalized in this activity?

[b] What does this activity teach us about the related concepts of operationalisation and reliability?

[c] How did the quality of instructions affect inter-rater reliability?

[d] In this exercise, we assessed test-retest reliability in the space of 5-10 minutes. In the real-world context of eyewitness recall, what might the spacing between testing sessions be? What would be most appropriate? What implications do you think the spacing might have for reliability coefficients and, therefore, criminal convictions?

[e] What is the difference between reliability and validity? How can you illustrate it based on the tasks we have just completed?

[f] Regardless of how specific or vague the instructions, were they valid?

[g] What are the real-world implications in terms of police procedure for encouraging accurate eyewitness memory recall?

Empirical Evidence for the Efficacy of the Exercise

Data were collected from 191 students enrolled in an Introduction to Research Methods in Psychology undergraduate course. The students participated in the exercise as part of their tutorial program. Prior to attending the tutorial students were expected to have engaged with a pre-recorded online presentation that introduced them to fundamental concepts relating to reliability. A within-participants design was employed. At the start of the tutorial students completed a four-item multiple choice quiz. Each item contained four responses but only one was correct. Students were instructed to choose the response that best answered the question. The items were concerned with (1) a broad definition of reliability; (2) a definition of inter-rater reliability; (3) a definition of test-retest reliability; and (4) the relation between reliability and operationalization. Participants also responded to a fifth item, “How confident are you that you understand the concept of reliability?” on a scale increasing in 10% increments from 0% to 100%.

Students were not given feedback on their answers. At the end of the tutorial, they responded to the same items again, after which feedback was given. All multiple-choice responses were coded so that a correct answer was coded as 1 and an incorrect answer (i.e., the other three responses) was coded as 0. Paired sample t-tests were conducted for each of the four pre and post multiple-choice items and the single confidence item. Descriptive and inferential statistics are summarized in Table 1.

Table 1.

Descriptive and Inferential Statistics for Testing Differences Between Pre- and Post-Exercise Knowledge and Confidence.

	Preexercise	Postexercise	t	p	d
	M (SD)	M (SD)	t	p	d
Reliability refers to…	0.74 (0.44)	0.84 (0.37)	2.89	.004	0.209
Inter-rater reliability refers to…	0.54 (0.50)	0.65 (0.48)	2.80	.006	0.202
Test-retest reliability refers to…	0.68 (0.47)	0.74 (0.44)	1.55	.122	0.112
A well-operationalized construct has implications for reliability because…	0.24 (0.43)	0.38 (0.49)	3.69	<.001	0.267
How confident are you that you understand the concept of reliability	55.26 (20.45)	74.71 (17.87)	13.78	<.001	1.00

Note. N = 191.

As shown in Table 1, there were statistically significant improvements in quiz performance on three out of the four questions, specifically, those relating to reliability in general, inter-rater reliability, and the implications of good operationalization for reliability. There was no significant difference on the item concerned with test-retest reliability, although means were in the anticipated direction. Effect sizes were weak, according to Cohen’s (1992) rules of thumb (i.e., a d value around .20 is interpreted as weak; a d value around .50 is interpreted as medium; a d value around .80 is interpreted as large). There was also a significant difference in student confidence ratings. That is, students were significantly more confident about their understanding of reliability after the exercise (mean ratings were around 75%) compared with before the exercise (mean ratings were around 55%. The effect size for confidence was strong. In short there is evidence that engaging in the exercise helped students to better understand concepts of reliability, and it helped them to feel more confident.

Next, a crosstab analysis was used to establish the percentage of students who improved after being initially incorrect. These analyses are summarized in Table 2. Table 2 indicates the percentage of student who were incorrect at both pre- and post-tests (e.g., for the first question, “Reliability refers to . . , ” 9% of students were wrong on both occasions; these students did not improve); incorrect at the pre-test but correct on the post-test (e.g., for the first question, this was 17% of students; these are the students who improved); correct on the pre-test but incorrect on the post-test (for the first question, this was 7% of students; these students performed worse after engaging in the exercise); and correct on both the pre- and post-tests (for the first question, this was 67% of students).

Considering the first three items together (i.e., concerned with reliability in general, inter-rater reliability, and test-retest reliability), 43%–67% of students (depending on the item) were correct across both pre- and post-tests, and 9%–23% were incorrect across both pre- and post-tests. While 7%–13% of students did worse on the post-test, it is notable that 17%–22% of students improved as a result of engaging with the exercise.

Table 2.

Percentage of Correct and Incorrect Responses Across Pre- and Postexercise Responses.

			Post
			Incorrect	Correct
Reliability refers to…	Pre	Incorrect	17 (9%)	32 (17%)
Reliability refers to…	Pre	Correct	13 (7%)	129 (67%)
Inter-rater reliability refers to…	Pre	Incorrect	45 (23%)	43 (22%)
Inter-rater reliability refers to…	Pre	Correct	21 (11%)	82 (43%)
Test-retest reliability refers to…	Pre	Incorrect	25 (13%)	36 (19%)
Test-retest reliability refers to…	Pre	Correct	24 (13%)	106 (55%)
A well-operationalized construct has implications for reliability because…	Pre	Incorrect	103 (54%)	42 (22%)
	Pre	Correct	15 (8%)	31 (16%)

Note. Percentages are rounded up therefore totals will not always equal 100.

For the fourth item (referring to the relation between operationalization and reliability), 54% were unable to improve post-test and 8% performed worse post-test compared with 22% who did improve. While the percentage of students who improved on this item was consistent with the other three items, it is notable that only 16% of students who were correct pre-test were correct post-test (compared with between 43%–67% on the other items). Clearly, this was a challenging item for students, possibly because the alternative response choices referred to “accuracy” and “validity” of measurement and “all of the above.” It is possible that the results for this item reflect the difficulty students often have in disentangling reliability from validity.

Conclusion and Discussion

Taken together, these data indicate that The Bank Robbery exercise helps to improve students’ understanding of reliability and increases their confidence about their understanding of the concept. Also, the exercise is easy for instructors to implement. While instructors can obtain a copy of the video by contacting the author, they could use any recording of behavior with permission.

The exercise could also be used to impress upon students the fundamental importance of good operationalization in the measurement process. While no data were collected to demonstrate the efficacy of the exercise for teaching operationalization, the students’ experiences during this exercise helps them to see from their very own data how inter-rater reliability improves once instructions are made clear and specific—in other words, better operationalized.

There may also be opportunity to demonstrate the vagaries of test-retest reliability by running the activity across two sessions, for example, by asking students in a subsequent tutorial to complete the checklist again. Doing so would more closely reflect the process of confirming eyewitness recall in real world policing contexts and provide educators and students with an opportunity to discuss additional effects on test-retest reliability (e.g., time, historical events).

Finally, the exercise deliberately separates reliability from validity in this exercise. As responses to one of the multiple-choice items suggests, and as other educators have no doubt found through their own experiences, students often struggle to distinguish between reliability and validity. Of course, the two are inextricably entwined. However, students are arguably better able to understand their differences and their relationship when the concepts are introduced as separate entities in separate teaching exercises. Nonetheless, there is scope within the present exercise for instructors to insert discussion of validity. For example, teachers could ask students to decide if initial instructions to simply describe the offender are valid (they are, but they are not reliable); ask students if the second, more specific set of instructions are valid (they are, and they are also reliable); and ask them if describing the color of the walls in the tutorial room is a valid way to measure eyewitness recall (it is not, but it would still produce reliable results) would help them to start seeing how the two concepts are related.

Footnotes

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Peter Strelan

References

Albright

T. D.

(2017). Why eyewitnesses fail. Proceedings of the National Academy of Sciences of the United States of America, 114(30), 7758–7764. https://doi.org/10.1073/pnas.1706891114

Brundiers

Wiek

Redman

C. L.

(2010). Real-world learning opportunities in sustainability: From classroom into the real world. International Journal of Sustainability in Higher Education, 11(4), 308–324. https://doi:10.1108/14676371011077540

Buck

J. L.

(1991). A demonstration of measurement error and reliability. Teaching of Psychology, 18(1), 46–47. https://doi:10.1207/s15328023top1801_16

Camac

C. R.

Camac

M. K.

(1993). A laboratory project in scale design: Teaching reliability and validity. Teaching of Psychology, 20(2), 102–104. https://doi:10.1207/s15328023top2002_8

Cohen

(1992). A power primer. Psychological Bulletin, 112(1), 155–159. https://doi.org/10.1037/0033-2909.112.1.155

Freeman

Eddy

S. L.

McDonough

Smith

M. K.

Okoroafor

Jordt

Wenderoth

M. P.

(2014). Active learning increases student performance in science, engineering, and mathematics. Proceedings of the National Academy of Sciences of the United States of America, 111(23), 8410–8415. https://doi:0.1073/pnas.1319030111

Herzog

H. A.

(1988). Naturalistic observation of behaviour: A model system using mice in a colony. Teaching of Psychology, 15(4), 200–202. https://doi:10.1207/s15328023top1504_6

Miserandino

(2006). I scream, you scream: Teaching validity and reliability via the ice cream personality test. Teaching of Psychology, 33(4), 265–268.

Moore

(1981). An empirical investigation and a classroom demonstration of reliability concepts. Teaching of Psychology, 8(3), 163–164. https://doi:10.1207/s15328023top0803_10

10.

Prince

(2004). Does active learning work? A review of the research. Journal of Engineering Education, 93(3), 223–231. https://doi.org/10.1002/j.2168-9830.2004.tb00809.x

11.

Strelan

(2018). Using the movies to illustrate the principles of experimental design. Teaching of Psychology, 45(2), 179–182. https://doi:10.1177/0098628318762908

12.

Strube

M. J.

(1991). Demonstrating the influence of sample size and reliability on study outcome. Teaching of Psychology, 18(2), 113–115. https://doi:10.1207/s15328023top1802_15

13.

Tuckey

M. R.

Brewer

(2003). How schemas affect eyewitness memory over repeated retrieval attempts. Applied Cognitive Psychology, 17(7), 785–800. https://doi:10.1002/acp.906

14.

Waller

M. J.

Sohrab

B. W.

(2013). Beyond 12 angry men: Thin-slicing film to illustrate group dynamics. Small Group Research, 44(4), 446–465. https://doi:10.1177/1046496413487409

15.

Wilson

J. H.

Joye

S. W.

(2017). Demonstrating interobserver reliability in naturalistic settings. In Stowell

J. E.

Addison

W. E.

(Eds.), Activities for teaching statistics and research methods: A guide for psychology instructors (pp. 110–113). American Psychological Association.