Abstract
Recognition of unfamiliar faces is difficult in part due to variations in expressions, angles, and image quality. Studies suggest shape and surface properties play varied roles in face learning, and identification of unfamiliar faces uses diagnostic pigmentation/surface reflectance relative to shape information. Here, participants sorted photo-cards of unfamiliar faces by identity, which were shown in their original, stretched, and contrast-negated forms, to examine the utility of diagnostic shape and surface properties in sorting unfamiliar faces by identity. In four experiments, we varied the presentation order of conditions (contrast-negated first or original first with stretched second across experiments) and whether the same or different photo-cards were seen across conditions. Stretching the images did not impair performance in any measures relative to other conditions. Contrast negation generally exacerbated poor sorting by identity compared with the other conditions. However, seeing the contrast-negated photo-cards last mitigated some of the effects of contrast negation. Together, results suggest an important role for surface properties such as pigmentation and reflectance for sorting by identity and add to literatures on informational content and appearance variability in discrimination of facial identity.
It is well documented that recognition of familiar faces is qualitatively different compared with recognition of unfamiliar faces (e.g., Burton, 2013; Burton & Jenkins, 2011; Burton et al., 2011; Jenkins & Burton, 2011; Johnston & Edmonds, 2009; Kramer, Young, et al., 2018). Identification of familiar faces is largely unharmed by common variations across viewpoint, lighting, and facial expression (for a review, see Johnston & Edmonds, 2009) and by less ecologically valid distortions such as linear stretching (Hole et al., 2002; Sandford & Rego, 2019; Sandford et al., 2018). In contrast, identification of unfamiliar faces is significantly more error prone (Burton & Jenkins, 2011; Johnston & Edmonds, 2009). For example, studies have found identification errors with subtle changes in facial appearance between different images from different capture devices used on the same day or with short durations between image capture (e.g., Bruce et al., 2001; Burton et al., 2010; Henderson et al., 2001; Megreya & Burton, 2006, 2007, 2008). The importance of understanding how to reduce these errors is highlighted by documented errors in realistic contexts (e.g., by cashiers, Kemp et al., 1997; by passport officers, White, Kemp, et al., 2014; by notaries, Papesh, 2018). Potentially, we can reduce errors found in tasks of face identification by isolating informational content in images to explore which content supports identification. Here, we combined two disparate literatures on within-person variability (Jenkins et al., 2011) and informational content provided by shape and surface (pigmentation/reflectance/texture) properties to explicitly explore the extent to which sorting unfamiliar faces by identity is supported by diagnostic shape and surface properties. To this end, we presented photo-cards of two unfamiliar identities used in previous research in their original, stretched, and contrast-negated forms.
Separating shape and surface properties has been employed to investigate which set of properties are relatively diagnostic of facial identity and has been done in a variety of ways. For example, some studies have found relative dominance of texture over shape properties in the recognition of familiar faces by using principal components analysis (e.g., Calder et al., 2001; Hancock et al., 1996). Using an alternative method, Itz et al. (2017) morphed two familiar or unfamiliar faces according to separable shape and texture information placed on continua where the ratio of shape-to-texture information between the two faces was manipulated (e.g., 20% information about shape or texture for Person A and 80% for Person B). Participants completed a delayed matching task with these morphed images and subsequent tests of familiar face recognition, unfamiliar face matching, and unfamiliar face memory. Across experiments, Itz et al. (2017) reported diagnosticity of texture was stronger than shape for all faces (but particularly for familiar faces), and observers with strong face recognition abilities (i.e., of familiar faces) showed reduced reliance on shape and increased reliance on texture. In contrast, below-average performance on unfamiliar face memory was strongly associated with greater tolerance to changes in shape properties. The relative diagnosticity of surface (pigmentation) properties has also been documented with personally familiar faces where observers identified faces to a greater extent with available idiosyncratic pigmentation information compared with shape information (Russell & Sinha, 2007). Taken together, this evidence suggests distortion of surface properties could make an identification task more difficult.
Previous work has manipulated informational content in images of faces that appear to align with shape and surface properties. For example, changes to shape, as defined in terms of configuration, by linear stretching in the height dimension do not impair recognition of familiar faces (Hole et al., 2002; Sandford & Rego, 2019; Sandford et al., 2018). Together with studies that have found a greater reliance on texture over shape (e.g., Kaufmann & Schweinberger, 2008; Itz et al., 2016, 2017), the role of shape appears reduced in recognition of familiar faces. However, the role of shape in identification of unfamiliar faces (e.g., Taschereau-Dumouchel et al., 2010) is dependent on definition of shape. Specifically, Taschereau-Dumouchel et al. (2010) found participants did not rely on interfeature metric distances (second-order configuration; Maurer et al., 2002) but suggested identification likely relies on attribute shape (e.g., of facial features) and attribute skin properties. Given that linear stretching alters not only the absolute metric distances between features in the height dimension and ratio of interfeature distances that cross one dimension but also shape of facial features and overall form of the face, we expected linear stretching to impair performance in our study (if impairment generalizes across task demands).
The utility of surface properties has been examined by comparing performance between face images presented in contrast-positive and contrast-negative. For example, Russell et al. (2006) warped together a set of 16 (half female) unfamiliar faces (within the same sex) to create 2 sets of images that retained diagnostic pigmentation or shape while averaging the other dimension across identities and used the original images of each identity for a shape + pigmentation condition. Participants completed a delayed match-to-sample task where a similar-looking same-sex distractor face was paired with the sample face. Contrast negation resulted in poorer identification accuracy in conditions where pigmentation information was diagnostic of facial identity (i.e., pigmentation and shape + pigmentation). Importantly, contrast negation did not impair identification of faces where shape was diagnostic of facial identity, suggesting shape (broadly defined, see earlier) is not impaired by a reversal of surface information (e.g., brightness, color, contrast of dark-to-light regions on the face). Contrast negation also impairs familiarity categorization of faces (relative to unfamiliar faces and relative to linear stretching; Sandford & Rego, 2019; Sandford et al., 2018), a task that requires observers to identify the faces by identity, as well as learning of unfamiliar faces after experimental familiarization via watching episodes of a TV show (Kramer et al., 2017). Therefore, we used contrast negation in our study to examine utility of surface properties when sorting faces by identity.
Within-person variability, a concept that captures appearance variations evident across different images of the same person, has recently been shown to be useful for the learning of new identities (Ritchie & Burton, 2017) and benefit identification of unfamiliar faces from multiple different images (e.g., Dowsett et al., 2016; Matthews & Mondloch, 2018). Images in card-sorting tasks typically present observers with diverse variations along unspecified dimensions within normal expectations of high-quality images obtained from the internet. That is, images are not typically selected on a particular dimension (e.g., lighting direction, facial expression). This is easily demonstrated if one searches for a reasonably well-known individual (e.g., celebrity) on the internet where multiple different images likely will be returned. Typically, researchers download 20 images of a number of identities (two or four) and ask participants to sort the images into piles or groups where each pile/group represents one identity (e.g., Balas et al., 2019; Balas & Pearson, 2017; Jenkins et al., 2011; Kramer, Manesi, et al., 2018; Zhou & Mondloch, 2016). This task has successfully replicated the superiority of discriminating known identities over unknown identities (Jenkins et al., 2011; Kramer, Manesi, et al., 2018; Zhou & Mondloch, 2016) and has shown certain informational content that is either useful (i.e., internal features) or relatively unhelpful (i.e., external features) for the identification of familiar and unfamiliar faces (Kramer, Manesi, et al., 2018).
While imposition of image blur (unqualified by degree of blur imposed) and picture-plane inversion has been shown to generalize from other face identification tasks to the card-sorting task (Balas et al., 2019), it is currently unknown whether previous evidence suggesting relative roles for shape and surface properties also generalizes to card-sorting tasks. We opted for card-sorting tasks in this study because the recently introduced fine-grained analyses employed in card-sorting tasks (more in the following sections) allow us to examine the relative utility of these two broad types of properties for discriminating facial identity where observers sort photo-cards that collectively represent instances of within- and between-person variability. Whereas a pair of photo-cards could show different images of the same person (instance of within-person variability), a pair of photo-cards could instead show different people (instance of between-person variability). One benefit of using a card-sorting task is the employment of a common analytical framework to examine errors made during sorting of faces by identity (Balas & Pearson, 2017). Here, errors made while sorting faces deformed by linear stretching and contrast negation would provide insight into certain diagnostic characteristics of shape and surface properties for discriminating facial identity. However, whether or not results of card-sorting tasks inform us about face learning is unclear. While some evidence suggests face learning and verification is enhanced by exposure to different images of the same person (e.g., Menon et al., 2015; Ritchie & Burton, 2017; White, Burton, et al., 2014), the evidence is often limited by only considering the influence of exposure to within-person variability to task performance. For example, participants are not exposed to different learning sets or arrays of variability that contain all different photographs of the same person’s face, all photographs of different people, or some combination of different photographs of the same person’s face and photographs of different people. The last condition in this example, to some extent, resembles the card-sorting task whereby participants sort through different photographs of the same two (or four) identities while simultaneously sorting through photographs of different people within the same set of photographs. Therefore, though we think the card-sorting task serves as a useful means of examining relative utility of diagnostic shape and surface properties in discriminating facial identity, we do not know whether results will generalize to other dependent measures of face learning or identity verification.
If the card-sorting task does use the same processes used in face identification (e.g., Russell et al., 2006) and learning tasks (e.g., Itz et al., 2016; Kaufmann & Schweinberger, 2008), then we expect to see performance impairments in the card-sorting task when photo-cards are presented in their stretched and contrast-negated forms relative to their original form. Given some evidence suggests identification of unfamiliar faces relies on surface properties to a greater extent than shape properties (evidenced between contrast-negative and contrast-positive images; Russell et al., 2006), we might expect contrast-negated images to be the most difficult to sort and so we began our study by presenting this condition before the stretched and original conditions.
Experiment 1A
In this experiment, participants were asked to complete a card-sorting task where images were sorted into groups, one for each perceived identity. Three different types of photo-cards were used: original (unchanged images as retrieved from the internet except for cropping around the head of each image), stretched (double the height of original images), and contrast-negated (inverting the color and brightness values of each original image). We investigated participant behavior in this task to explore the utility of informational content in sets of naturally occurring images of unfamiliar identities. Based on previous research, we expected to observe differences in sorting behavior across the types of cards due to the manipulation of shape properties and surface properties. Participants were presented with the same photo-cards across conditions in the following order: contrast-negated, stretched, and original on the assumption that pigmentation/reflectance properties appear to have an important role in other face tasks (e.g., discrimination, identification, e.g., Kemp et al., 1990, 1996; Russell et al., 2006) relative to shape (Russell et al., 2006).
Method
Participants
Thirty-six participants (M = 24.5 years, SD = 9.4; 29 female) were recruited for Experiment 1A. This sample comprised students or staff of the University of Guelph-Humber and Humber Institute of Technology and Advanced Learning. Each participant was compensated with a gift card.
Stimuli
We obtained images for our study by searching for images of two Dutch celebrities (Chantal Janzen and Bridget Maasland). These celebrities are known to be familiar in the Netherlands but not in other locations (Jenkins et al., 2011). We checked familiarity of celebrities with each participant in this study, reported later. In total, we downloaded 60 images per celebrity that presented faces in frontal or roughly frontal aspect, with varying gazes and facial expressions. All images were free from occlusions and were mostly large-sized, high-quality images as determined by the first author with the aid of Google Image tools. After downloading the 120 images, we cropped each image around the head while retaining extraneous background. The minimum size of the cropped images around the head prior to testing was 300 × 420 pixels. Each of these images were labelled original and saved at 72 pixels per inch. All original images were printed and laminated at approximately 4.15 × 6 centimetres. Duplicates of these images were transformed into two additional stimuli conditions: stretched and contrast-negated. The original images were stretched to twice their original height (i.e., 300 × 840 pixels or approximately 4.15 × 12 centimetres) to create the stretched images. The original images were contrast-negated by inverting the color and brightness values to create contrast-negated images. As with the original images, the stretched and contrast-negated images were printed and laminated. All conditions of images were created using Adobe Photoshop CC 2017 and resulted in 360 images in total (60 images per celebrity × 2 celebrities × 3 conditions of stimuli). We cannot show the stimuli used in this study due to copyright. See Figure 1 for representative examples of stimuli.

Stimuli that represent the types of images to which we refer throughout this study as naturally occurring photographs and were shown to participants in all reported experiments. Note the stimuli in the figure are just examples (second author of this study with their consent) and are not Chantal Janzen or Bridget Maasland. Also note that because within-person variability is unique to each person’s face (Burton et al., 2016), the figure stimuli do not represent the variability unique to the faces we used in our study.
Procedure
In all experiments reported here, we divided the 360 images into three sets. In Experiments 1A and 1B, each set contained the same images across the three conditions. In Experiment 1A, each participant was given one of the sets (counterbalanced across participants) of 40 photo-cards to sort. Each participant sorted the contrast-negated, stretched, and then original images. Participants were instructed to group the correct photo-cards into groups of the same identity, which could be any number between 1 (i.e., all 40 photographs show the same person) and 40 (i.e., each photograph shows a unique person). When participants had finished grouping the photo-cards of one stimulus condition, the number of groups and composition of groups (i.e., how many photo-cards of celebrity 1 vs. 2) were recorded, and the next condition of photo-cards was presented to the participant. Photo-cards were shuffled before handing them to each participant. None of the participants were informed that the photo-cards showed the same images despite deformations across conditions, and no feedback on performance was provided between conditions. After all photo-cards were sorted, each participant was asked if they were familiar with the identities to check for prior familiarity with the faces. None of the participants reported being familiar with the identities.
Results
Card-sorting data have typically been analyzed on the basis of the number of groups created by participants, each representative of one identity from each participant’s perspective (e.g., Jenkins et al., 2011; Zhou & Mondloch, 2016). In our study, we refer to these data as numerosity. Recently, more fine-grained analyses have been formalized in the context of signal detection theory (SDT) measures of sensitivity (d’), and response bias (criterion or C; Balas & Pearson, 2017). These analyses begin by calculating the proportions of different person/same group errors (i.e., grouping images of different people in the same group of photo-cards) and “same person/different group” errors (i.e., separating images of the same person into different groups of photo-cards). These two types of errors refer to distinct categorizing of different identities as the same person and the same identities as different people, respectively. From these errors, it is possible to calculate d’ and criterion using the conventional formula [z(Hits) – z(False alarms), where Hits = 1—different person/same group error rate and False alarms = same person/different group error rate (Balas & Pearson, 2017). With these calculations, the “different person/same group” error rate refers to proportion of errors in processing between-person variability (instead processing these pairs as instances of within-person variability), and the “same person/different group” error rate refers to proportion of errors in processing within-person variability (instead processing these pairs as instances of between-person variability).
Data for numerosity and SDT measures were analyzed separately using one-way repeated measures analyses of variance (ANOVAs). All pairwise comparisons were Bonferroni-adjusted (α = .05/3 = .0167). Data for Experiment 1A are summarized in Figure 2.

Results of Experiment 1A. Top row shows numerosity (i.e., number of groups made by participants), middle row shows sensitivity (d’), and bottom row shows response bias (C) data for original (left column), stretched (middle column), and contrast conditions (right column). The bolded black line connected to the y axis represents the average, and each circle represents one participant’s data on each measure in each condition. Note the circles were randomly placed along the x axis such that the same placement of a circle on each graph does not necessarily show the same participant.
We applied a Greenhouse–Geisser correction for violations of the sphericity assumption (Mauchly’s test for numerosity: χ2(2) = 35.998, p < .001; Mauchly’s test for response bias: χ2(2) = 9.255, p = .010; nonsignificant Mauchly’s test for sensitivity: χ2(2) = 3.575, p = .167). We found a significant main effects of condition for numerosity, F(1.120, 42.344) = 18.31, p < .001, ηp2 = 0.34, and sensitivity, F(2, 70) = 27.67, p < .001, ηp2 = 0.44. Participants created more groups when photographs were in their contrast-negated form (M = 13.25, 95% confidence interval [CI] [10.90, 15.60]) than in their original (M = 7.44, 95% CI [5.54, 9.35]), t(35) = 4.478, p < .001, d = 0.746, and stretched forms (M = 7.53, 95% CI [5.41, 9.64]), t(35) = 4.367, p < .001, d = 0.728. The difference between original and stretched conditions was nonsignificant, t(35) = 0.173, p = .864, d = 0.029.
Participants were less sensitive to between-person variability for contrast-negated images (M = 0.927, 95% CI [0.720, 1.134]) than original (M = 2.872, 95% CI [2.373, 3.371]), t(35) = 6.878, p < .001, d = 1.146, and stretched images (M = 2.522, 95% CI [2.032, 3.012]), t(35) = 6.778, p < .001, d = 1.130. The difference in sensitivity between original and stretched conditions was nonsignificant, t(35) = 1.119, p = .135, d = 0.187.
Participants were biased to label pairs as instances of between-person variability (one-sample t tests for original: t(35) = 5.209, p < .001, d = 0.868; stretched: t(35) = 4.375, p < .001, d = 0.729; contrast-negated: t(35) = 8.953, p < .001, d = 1.492). However, the differences in response bias between conditions were nonsignificant, F(1.615, 56.529) = 2.87, p = .076, ηp2 = 0.08, which suggests participants’ greater fragmentation of contrast-negated photo-cards was not attributable to different strength of response bias between conditions.
Discussion
Replicating previous findings, we found sorting photo-cards of unfamiliar faces by identity is a difficult task (Balas et al., 2019; Balas & Pearson, 2017; Jenkins et al., 2011; Kramer, Manesi, et al., 2018; Zhou & Mondloch, 2016). For example, we observed fragmentation of cards in all conditions into more groups than necessary for the two identities across all conditions for most participants. Extending previous findings, we found no difference in the number of groups, sensitivity to between-person variability, or response bias between photo-cards of unfamiliar faces presented in original and stretched conditions. This suggests sorting photo-cards by identity does not appear to rely on configurational information impaired by linear stretching (e.g., interfeature distances in the manipulated dimension, ratio of distances that cross the manipulated dimension; see Hole et al., 2002; Sandford & Rego, 2019; Sandford et al., 2018). However, sorting performance in numerosity and sensitivity-dependent measures was improved compared with contrast-negated versions of the same photo-cards. The relatively poorer performance with contrast-negated versions of the photo-cards suggests utility for information disrupted by contrast negation (e.g., pigmentation/reflectance; Russell et al., 2006). However, by presenting the contrast-negated photo-cards first, our participants might have been at a disadvantage because they could not use surface properties impaired by contrast negation. This might have inflated the observed impaired performance in this condition across all dependent measures. When images contain the full spectrum of shape and surface properties (unmanipulated by the researchers), the number of groups created by participants was smaller (median: 7.5; Jenkins et al., 2011). We discuss this in more detail in the General Discussion section. To address whether contrast negation does impair all dependent measures compared with the other two conditions, we repeated Experiment 1A but with order of conditions reversed.
Experiment 1B
In reversing the order of conditions of Experiment 1A, we now presented the photo-cards first in their original form then in their stretched and contrast-negated forms. If utility of surface properties is particularly important for sorting photo-cards by identity, then we would expect poorer performance with contrast-negated versions of the photo-cards compared with the other two conditions as in Experiment 1A. However, if presenting the full spectrum of shape and surface properties before isolating this informational content mitigates the deleterious effects of contrast negation, then we might expect either similar performance outcomes between conditions or the worst performance in the first condition (as in Experiment 1A) in the form of the original version of photo-cards.
Method
Participants
A new group of 36 participants (M = 28.6 years, SD 14.6; 30 females) was recruited for Experiment 1B. This sample comprised students and staff of the University of Guelph-Humber and Humber Institute of Technology and Advanced Learning. Each participant was compensated with a gift card.
Stimuli and Procedure
The same stimuli as in Experiment 1A were used in this experiment, except the order of stimulus conditions was changed: original, stretched, and then contrast-negated. As before, participants were asked whether they were familiar with the celebrities prior to participating in the experiment. None of the participants reported being familiar with the identities.
Results
Data for numerosity and SDT measures were analyzed separately using one-way repeated measures ANOVAs. All pairwise comparisons were Bonferroni-adjusted (α = .05/3 = .0167). Data for Experiment 1B are summarized in Figure 3.

Results of Experiment 1B. Top row shows numerosity (i.e., number of groups made by participants), middle row shows sensitivity (d’), and bottom row shows response bias (C) data for original (left column), stretched (middle column), and contrast conditions (right column). The bolded black line connected to the y axis represents the average, and each circle represents one participant’s data on each measure in each condition. Note the circles were randomly placed along the x axis such that the same placement of a circle on each graph does not necessarily show the same participant.
We applied a Greenhouse–Geisser correction for a violation of the sphericity assumption for response bias (Mauchly’s test: χ2(2) = 16.403, p < .001). We found a significant main effect of condition for our measure of sensitivity, F(2, 70) = 19.28, p < .001, ηp2 = 0.36, due to participants’ lower sensitivity to between-person variability for contrast-negated images (M = 2.058, 95% CI [1.536, 2.580]) compared with original (M = 3.042, 95% CI [2.528, 3.556]), t(35) = 4.604, p < .001, d = 0.767, and stretched images (M = 3.219, 95% CI [2.673, 3.765]), t(35) = 5.978, p < .001, d = 0.996. The difference in sensitivity between original and stretched conditions was nonsignificant, t(35) = 0.902, p = .187, d = 0.150.
Participants were biased to label pairs as instances of between-person variability—one-sample t tests for original: t(35) = 10.675, p < .001, d = 1.779; stretched: t(35) = 5.492, p < .001, d = 0.990; contrast-negated: t(35) = 8.907, p < .001, d = 1.485. However, the differences in response bias between conditions were nonsignificant, F(1.446, 50.625) = 2.28, p = .127, ηp2 =.061, which aligns with the numerosity data. The main effect of condition for numerosity was also nonsignificant, F(2, 70) = 2.39, p = .099, ηp2 = 0.06.
We ran a 2 (Experiment: 1A, 1B) × 3 (condition: original, stretched, contrast-negated) mixed ANOVA with experiment varied between-subjects and condition varied within-subjects on each dependent variable reported earlier. A Greenhouse–Geisser correction was applied on the numerosity data (Mauchly’s test: χ2(2) = 50.569, p < .001). The ANOVAs returned a significant main effect of condition in numerosity, F(1.316, 92.137) = 17.77, p < .001, ηp2 = 0.20, and sensitivity data, F(2, 140) = 45.68, p < .001, ηp2 = 0.40, but not in response bias data, F(2, 140) = 1.95, p = .146, ηp2 = 0.03. Interactions between factors were found for each dependent variable—numerosity: F(1.316, 92.137) = 15.07, p < .001, ηp2 = 0.18; sensitivity: F(2, 140) = 3.91, p = .022, ηp2 = 0.05; response bias: F(2, 140) = 3.51, p = .033, ηp2 = 0.05.
Simple main effects analyses compared performance on each dependent variable between experiments. On average, participants of Experiment 1A created more groups than participants of Experiment 1B, F(1, 70) = 13.97, p < .001, ηp2 = 0.06) (Experiment 1A: M = 13.25, 95% CI [10.904, 15.596] vs. Experiment 1B: M = 7.972, 95% CI [6.113, 9.831]). They also exhibited poorer sensitivity to between-person variability than participants of Experiment 1B with the contrast-negated versions of the photo-cards, F(1, 70) = 10.78, p = .002, ηp2 = 0.05 (Experiment 1A: M = 0.927, 95% CI [0.720, 1.134] vs. Experiment 1B: M = 2.058, 95% CI [1.536, 2.580]). However, strength of response bias to label pairs of images instances of between-person variability did not statistically differ between the two experiments, F(1, 70) = 2.59, p = .112, ηp2 = 0.01) (Experiment 1A: M = –1.404, 95% CI [–1.711, –1.096] vs. Experiment 1B: M = –1.025, 95% CI [–1.250, –0.799]). Simple main effects analyses also showed poorer sensitivity for stretched versions of the photo-cards in Experiment 1A compared with Experiment 1B, F(1, 70) = 4.09, p = .047, ηp2 = 0.02 (Experiment 1A: M = 2.522, 95% CI [2.032, 3.012] vs. Experiment 1B: M = 3.219, 95% CI [2.673, 3.765]). However, sensitivity did not differ between experiments for original versions of the photo-cards, F(1, 70) = 0.24, p = .626, ηp2 < 0.01, and did not differ between original and stretched versions of the photo-cards in both experiments. Therefore, this result does not reveal much beyond some participants’ exhibiting better sensitivity for stretched versions of photo-cards. Simple main effects analyses for original and stretched photo-cards returned nonsignificant results in numerosity and response bias (Fs ≤ 0.86, ps ≥ .357, ηp2 < 0.01).
Discussion
In this experiment, we presented photo-cards in their original form first and contrast-negated form last, which resulted in some similar, and different, patterns to those observed in Experiment 1A. Linear stretching, again, did not enhance or impair sorting photo-cards by identity. Biases to label two photo-cards as an instance of between-person variability, again, did not differ between conditions. Last, participants continued to exhibit poorer sensitivity for facial identity between persons when sorting contrast-negated photo-cards compared with original and stretched versions of the same photo-cards. However, this sensitivity was also relatively stronger in Experiment 1B compared with Experiment 1A, a finding probably due to the presentation of the contrast-negated photo-cards as the last rather than the first condition. In addition, we did not observe greater fragmentation of contrast-negated photo-cards into multiple identity groups in Experiment 1B as we did in Experiment 1A compared with the original and stretched versions of the same photo-cards. Collectively, these results suggest exposure to full spectrum information about shape (as disrupted by linear stretching) and surface properties (as disrupted by contrast negation) is important to cohering multiple images into an identity, and disrupting utility of surface properties impairs sensitivity to facial identity between persons. However, aside from sorting faces by identity perhaps being an unusual task in realistic contexts (e.g., border control), we did not display different images of these identities between conditions in our first two experiments. While our intention was to focus on the extent to which shape and surface properties contribute to discriminating facial identity using card-sorting methodology, which required removing the confound of different quality of shape and surface properties inherent to different images (of the same person), our design might also permit observers to use whichever information they could retain from previously seen photo-cards. We addressed this in the next two experiments.
Experiment 2A
Card-sorting tasks are typically used to examine the influence of variability inherent to naturally occurring (or uncontrolled) photographs on dependent variables (e.g., Balas & Pearson, 2017; Jenkins et al., 2011; Kramer, Manesi, et al., 2018; Zhou & Mondloch, 2016) or explore differences on performance measures between different stimulus conditions (e.g., various levels of blur) that vary as between-subjects factors (Balas et al., 2019; see also Kramer, Manesi, et al., 2018 for a report of card-sorting performance with full faces vs. internal features only vs. external features only). In Experiment 2A, we presented all three sets of images viewed across participants of Experiments 1A and 1B with each stimulus set allocated to a stimulus condition. With this design, we intended to explore the influence of variability of facial appearance in addition to the stimulus conditions. Previous research has reliably shown sorting photo-cards of unfamiliar faces results in poor discrimination of identity, which is mostly attributed to difficulties with processing of within-person variability (i.e., large same person/different group errors, e.g., Balas & Pearson, 2017). Therefore, we predicted the inability to rely on viewing the same images throughout the task would result in poorer performance across measures between conditions relative to that observed in Experiments 1A and 1B. However, if this does not further impair performance, then we predicted a similar pattern of results between Experiments 1A and 2A and between Experiments 1B and 2B: Poorer performance with contrast-negated photo-cards across some or all of the dependent variables is reported earlier.
Method
Participants
A new group of 36 participants (M = 23.5 years, SD = 8.3; 31 females) were recruited for Experiment 2A. This sample comprised students and staff of the University of Guelph-Humber and Humber Institute of Technology and Advanced Learning. Each participant was compensated with a gift card.
Stimuli and Procedure
In Experiments 2A and 2B, we counterbalanced the three sets of 40 images (20 of Chantal Janzen and 20 of Bridget Maasland) across conditions and participants. In these experiments, participants could not rely on specific images they had observed in preceding stimulus conditions because one set of 40 images would be observed in one stimulus condition, followed by a different set of 40 images in another stimulus condition, and then the last set of 40 images in the final stimulus condition. This information was not provided to the participants, and, as before, participants were not provided with any feedback on performance between conditions. In Experiment 2A, participants sorted each set of 40 images in the same order as participants in Experiment 1A: contrast-negated, stretched, and original. None of the participants reported being familiar with the identities.
Results
Data for numerosity and SDT measures were analyzed separately using one-way repeated measures ANOVAs. All pairwise comparisons were Bonferroni-adjusted (α = .05/3 = .0167). Data for Experiment 2A are summarized in Figure 4.

Results of Experiment 2A. Top row shows numerosity (i.e., number of groups made by participants), middle row shows sensitivity (d’), and bottom row shows response bias (C) data for original (left column), stretched (middle column), and contrast conditions (right column). The bolded black line connected to the y axis represents the average, and each circle represents one participant’s data on each measure in each condition. Note the circles were randomly placed along the x axis such that the same placement of a circle on each graph does not necessarily show the same participant.
For each outcome measure, we found significant main effects of condition—numerosity: F(1.479, 51.752) = 35.05, p < .001, ηp2 = 0.50; sensitivity: F(2, 70) = 39.94, p < .001, ηp2 = 0.53, and response bias: F(2, 70) = 13.67, p < .001, ηp2 = 0.28. Note a Greenhouse–Geisser correction was used on numerosity data where a sphericity violation was found (Mauchly’s test: χ2(2) = 14.783, p < .001). Participants created more groups when photographs were contrast-negated (M = 12.56, 95% CI [10.25, 14.86]) than in their stretched (M = 6.47, 95% CI [4.93, 8.02]), t(35) = 6.165, p < .001, d = 1.028, and original forms (M = 5.61, 95% CI [4.43, 6.79]), t(35) = 6.510, p < .001 d = 1.085. The difference in numerosity for original and stretched images was nonsignificant, t(35) = 1.474, p = .150, d = 0.246.
Participants were less sensitive to between-person variability for contrast-negated images (M = 1.281, 95% CI [0.986, 1.577]) than original (M = 3.615, 95% CI [3.105, 4.126]), t(35) = 7.776, p < .001, d = 1.772, and stretched images (M = 3.06, 95% CI [2.520, 3.600]), t(35) = 6.597, p < .001, d = 1.100. The difference in sensitivity between original and stretched conditions was nonsignificant, t(35) = 2.156, p = .019, d = 0.359.
Last, participants were biased to label pairs as instances of between-person variability—one-sample t tests for original: t(35) = 9.069, p < .001, d = 1.551; stretched: t(35) = 9.056, p < .001, d = 1.509; contrast-negated: t(35) = 16.536, p < .001, d = 2.756. Moreover, their bias to label pairs as instances of within-person variability was stronger for contrast-negated images (M = –1.506, 95% CI [–1.684, –1.327]) compared with original (M = –0.998, 95% CI [–1.214, –0.782]), t(35) = 4.355, p < .001, d = 0.726, and stretched images (M = –1.004, 95% CI [–1.221, –0.787]), t(35) = 4.800, p < .001, d = 0.800. This suggests the greater fragmentation of contrast-negated photo-cards compared with the other conditions is, in part, due to a stronger response bias. The difference between the original and stretched conditions was nonsignificant, t(35) = 0.53, p = .479, d = 0.009.
We ran a 2 (Experiment: 1A, 2A) × 3 (condition: original, stretched, contrast-negated) mixed ANOVA with experiment varied between-subjects and condition varied within-subjects on each dependent variable reported earlier. A Greenhouse–Geisser correction was applied on numerosity (Mauchly’s test: χ2(2) = 49.986, p < .001) and response bias data (Mauchly’s test: χ2(2) = 11.258, p = .004). This returned a significant main effect of condition for each dependent variable—numerosity: F(1.320, 92.385) = 49.78, p < .001, ηp2 = 0.42; sensitivity: F(2, 140) = 66.02, p < .001, ηp2 = 0.49; response bias: F(1.738, 121.682) = 10.43, p < .001, ηp2 = 0.13, confirming our initial findings in Experiments 1A and 2A. We also found participants of Experiment 2A exhibited greater sensitivity to between-person variability irrespective of condition, F(1, 70) = 5.900, p = .018, ηp2 = 0.08, but nonsignificant differences for numerosity and response bias (main effects of experiment were Fs ≤ 1.23, ps ≥ .271, ηp2 ≤ 0.02). There were non-significant interactions in each mixed ANOVA, Fs ≤ 0.39, ps ≥ .678, ηp2 ≤ 0.01.
Discussion
Replicating the findings of Experiment 1A, sorting unfamiliar faces by identity was considerably more difficult for contrast-negated photo-cards compared with both original and stretched photo-cards. As before, there were no differences in sorting performance between original and stretched photo-cards. Unlike Experiment 1A, participants could not rely on previously seen images, given different images of the same identities were sorted between conditions. Despite this, participants’ sorting of photo-cards did not differ (between conditions) between the experiments for numerosity and sensitivity measures. However, we note two differences. Overall sensitivity was better in Experiment 2A, a finding which might be due to participant factors such as better discrimination of facial identity (see the General Discussion section). Response bias was stronger in Experiment 2A for contrast-negated images relative to original and stretched images, which was not observed in Experiment 1A (though response bias for contrast-negated images in Experiment 2A was not stronger than contrast-negated images in Experiment 1A, a result we think is important to highlight, given these images were presented first to both sets of participants). Together, these results suggest discrimination of facial identity using card-sorting methodology is especially supported by surface properties such as pigmentation and reflectance. In the next experiment, we employed the condition order of Experiment 1B but different sets of images between conditions as in Experiment 2A.
Experiment 2B
In this experiment, we asked participants to sort different images between original, stretched, and contrast-negated photo-cards. If performance is influenced by variability in facial appearance across photo-cards across conditions, then we would expect to observe differences between performance in Experiments 1B and 2B. We rationalize this prediction with previous evidence that has consistently found the card-sorting task to be difficult (with original images) and is more difficult compared with sorting familiar faces (e.g., Jenkins et al., 2011; Kramer, Manesi, et al., 2018; Zhou & Mondloch, 2016). In this case, sorting behavior could be attributed to condition of image (if performance differs between conditions) and variability in image sets between conditions. However, if performance is not influenced by variability in facial appearance between conditions, then we could attribute differences in card-sorting performance to the order of conditions and certain manipulations of informational content (e.g., pigmentation and/or reflectance by contrast negation).
Method
Participants
A new group of 36 participants (M = 28.7 years, SD = 11.6; 25 female) was recruited for Experiment 2B. This sample comprised students and staff of the University of Guelph-Humber and Humber Institute of Technology and Advanced Learning. Each participant was compensated with a gift card.
Stimuli and Procedure
The same stimuli as in Experiment 2A were used in this experiment, except the order of stimulus conditions was changed: original, stretched, and then contrast-negated. None of the participants reported being familiar with the identities.
Results
Data for numerosity and SDT measures were analyzed separately using one-way repeated measures ANOVAs. All pairwise comparisons were Bonferroni-adjusted (α = .05/3 = .0167). Data for Experiment 2B are summarized in Figure 5.

Results of Experiment 2B. Top row shows numerosity (i.e., number of groups made by participants), middle row shows sensitivity (d’), and bottom row shows response bias (C) data for original (left column), stretched (middle column), and contrast conditions (right column). The bolded black line connected to the y axis represents the average, and each circle represents one participant’s data on each measure in each condition. Note the circles were randomly placed along the x axis such that the same placement of a circle on each graph does not necessarily show the same participant.
We found significant main effects of condition for our measures of numerosity and sensitivity, but not response bias—numerosity: F(2, 70) = 5.81, p < .005, ηp2 = 0.14; sensitivity: F(2, 70) = 38.28, p < .001, ηp2 = 0.52; response bias: F(2, 70) = 1.98, p = .146, ηp2 = 0.05. Participants created more groups with contrast-negated photographs (M = 9.86, 95% CI [7.05, 12.67]) compared with stretched photographs (M = 7.25, 95% CI [5.51, 8.99]), t(35) = 2.972, p = .005, d = 0.495. There were no differences in the number of groups between original (M = 8.75, 95% CI [6.797, 10.703]) and stretched photo-cards, t(35) = 2.398, p = .022, d = 0.400, and original and contrast-negated photo-cards, t(35) = 1.424, p = .163, d = 0.237.
Participants were less sensitive to between-person variability for contrast-negated images (M = 1.371, 95% CI [1.010, 1.733]) than original (M = 3.157, 95% CI [2.680, 3.635]), t(35) = 7.105, p < .001, d = 1.184, and stretched images (M = 3.16, 95% CI [2.671, 3.650]), t(35) = 7.179, p < .001, d = 1.197. The difference in sensitivity between original and stretched conditions was nonsignificant, t(35) = 0.017, p = .493, d = 0.003.
As in previous experiments reported here, participants’ responses were biased toward labelling pairs of images as instances of between-person variability—one-sample t tests for original: t(35) = 11.635, p < .001, d = 1.939; stretched: t(35) = 10.227, p < .001, d = 1.705; contrast-negated: t(35) = 8.649, p < .001, d = 1.442. The nonsignificant main effect of condition for response bias suggests the strength of these biases was not different between conditions, which aligns with the numerosity data.
We ran two mixed ANOVAs to compare card-sorting performance between Experiments 1B and 2B and between Experiments 2A and 2B. We conducted the first mixed ANOVA to explore differences between participants who saw the same photo-cards between conditions and different photo-cards (but of the same two celebrities) between conditions. We conducted the second mixed ANOVA to explore differences between participants who saw different photo-cards between conditions where contrast-negated cards were sorted first (Experiment 2A) and original cards were sorted first (Experiment 2B).
In the first case, we conducted a 2 (Experiment: 1B, 2B) × 3 (condition: original, stretched, contrast-negated) ANOVA with experiment varied between-subjects and condition varied within-subjects on each dependent variable reported earlier. A Greenhouse–Geisser correction was applied on response bias (Mauchly’s test: χ2(2) = 8.410, p = .015). This returned a significant main effect of condition for all dependent measures—numerosity: F(2, 140) = 7.55, p < .001, ηp2 = 0.10; sensitivity: F(2, 140) = 56.84, p < .001, ηp2 = 0.49; response bias: F(1.794, 125.588) = 4.23, p = .020, ηp2 = 0.06, confirming our initial findings in Experiments 1A and 2A. There were no significant main effects of experiment (Fs ≤ 0.48, ps ≥ .491, ηp2 ≤ 0.07) or interaction of factors for numerosity and response bias (Fs ≤ 2.59, ps ≥ .079, ηp2 ≤ 0.04), but there was a significant interaction of factors in sensitivity to between-person variability, F(2, 140) = 3.70, p = .027, ηp2 = 0.05. A simple main effects analysis showed the difference between experiments for contrast-negated photo-cards was marginally nonsignificant, F(1, 70) = 3.79, p = .056, ηp2 = 0.02 (Experiment 1B: M = 2.058, 95% CI [1.536, 2.580] vs. Experiment 2B: M = 1.371, 95% CI [1.010, 1.733). Simple main effects analyses did not reveal differences between Experiments 1B and 2B for original and stretched photo-cards, F(1, 70) = 0.11, p = .741, ηp2 < 0.01 and F(1, 70) = 0.03, p = .863, ηp2 < 0.01.
In the second case, we ran a 2 (Experiment: 2A, 2B) × 3 (condition: original, stretched, contrast-negated) mixed ANOVA with experiment varied between-subjects and condition varied within-subjects on each dependent variable reported earlier. A Greenhouse–Geisser correction was applied on numerosity (Mauchly’s test: χ2(2) = 18.047, p < .001). Each ANOVA returned a significant main effect of condition—numerosity: F(1.626, 113.808) = 33.34, p < .001, ηp2 = 0.32; sensitivity: F(2, 140) = 75.77, p < .001, ηp2 = 0.52; response bias: F(2, 140) = 5.08, p = .007, ηp2 = 0.07. While no interaction was found in the sensitivity data, F(2, 140) = 1.55, p = .217, ηp2 = 0.02), there were significant interactions in both numerosity, F(1.626, 113.808) = 12.23, p < .001, ηp2 = 0.15, and response bias data, F(2, 140) = 11.73, p < .001, ηp2 = 0.14. Simple main effects analysis showed participants of Experiment 2A created fewer groups with photo-cards in their original form (M = 5.61, 95% CI [4.43, 6.79]), on average, compared with participants of Experiment 2B (M = 8.75, 95% CI [6.797, 10.703]), F(1, 70) = 5.10, p = .027, ηp2 = 0.02. Simple main effects analysis showed participants of Experiment 2A exhibited a stronger bias to label a pair of contrast-negated photo-cards as an instance of between-person variability (M = –1.506, 95% CI [–1.684, –1.327]) compared with participants of 2B (M = –1.115, 95% CI [–1.368, –0.862]), F(1, 70) = 6.15, p = .016, ηp2 = 0.03. No other simple main effects were significant (Fs ≤ 3.74, ps ≥ .055, ηp2 ≤0.02). Main effects of experiment were also nonsignificant (Fs ≤ 0.13, ps ≥ .720, ηp2 < 0.01).
Discussion
Consistent with the other experiments reported here, participants’ sensitivity to facial identity was impaired by contrast negation compared with the original and stretched conditions and demonstrated reliable response biases to label pairs of images as instances of between-person variability. These biases were no stronger or weaker between conditions as observed in Experiments 1A and 1B. Our first mixed ANOVA did not reveal any significant differences between the experiments for our reported dependent measures. The only point worth some note is the marginally nonsignificant results in sensitivity. Poorer sensitivity to between-person variability in Experiment 2B might be due to the preceding photo-cards that showed different photo-cards in original and stretched conditions, or they are due to participant factors. More data are needed to explore this further.
The results of our second mixed ANOVA are unsurprising in themselves. Participants of this experiment created more groups for original photo-cards (seen in the first condition) compared with participants in Experiment 2A (photo-cards seen in the last condition). They also demonstrated a relatively weaker bias to label a pair of contrast-negated photo-cards (seen in the last condition) as an instance of between-person variability compared with participants of Experiment 2A where these photo-cards were seen in the first condition. Indeed, the variability in performance between participants (see Figures 2 to 5) might help explain the original results of this mixed ANOVA, by simply having some better performers in Experiment 2A (but note that performance did not specifically differ in this condition in any dependent measure between Experiments 1A and 2A). Moreover, in contrast with finding significant differences in numerosity and sensitivity measures for contrast-negated photo-cards between Experiments 1A and 1B, we found only response bias to be different for these photo-cards between Experiments 2A and 2B. We interpret these results as suggesting some further difficulty imposed on the sorting of contrast-negated photographs by identity when different photographs are presented between conditions. This suggests the utility of informational content disrupted by contrast negation (i.e., certain surface properties) could have a role in discriminating photographs for facial identity.
General Discussion
In four experiments, we find two general results: (a) no differences in sorting behavior between original and stretched photo-cards in any dependent variable, and (b) contrast negation impairs all card-sorting performance measures that we reported. Consistent with previous reports, numerosity data, on average, indicated participants could not cohere multiple different photographs of unfamiliar people into unique identities (e.g., Balas et al., 2019; Balas & Pearson, 2017; Kramer, Manesi, et al., 2018), which was not attributable to response biases. Sensitivity to facial identity between persons was reliably impaired by contrast negation whether contrast-negated photo-cards were presented first or last, and number of groups was higher when presented first (Experiments 1A and 2A) compared with original and stretched images (not attributed to stronger response bias in three of the four experiments). Consistent with previous reports (e.g., Balas et al., 2019; Balas & Pearson, 2017), participants in all experiments exhibited bias to label pairs as instances of between-person variability (aligning with numerosity data). Together, our results suggest that even when the full spectrum of shape and surface properties were on display at the beginning of our procedure, participants’ processing of facial identity was impaired by disruption to full utility of surface properties.
By finding some evidence of impaired performance in the number of groups created and sensitivity to facial identity, does this mean surface properties (as disrupted by contrast negation) contribute to cohering a single facial identity from multiple images? This is potentially an interesting question that could contribute to our theoretical understanding of facial identity processing. We are not the first to find contrast negation impairs discrimination of facial identity (e.g., Kemp et al., 1990); and contrast negation has been shown to disrupt utility of diagnostic pigmentation (Russell et al., 2006). Research has reliably demonstrated a stronger relative role for surface properties than shape in facial identity processing (e.g., Calder et al., 2001; Hancock et al., 1996; Itz et al., 2017). However, whereas previous studies typically used lab-based/controlled images, participants of card-sorting tasks must simultaneously process differences in facial appearances due to changes across different images of the same person and between two different identities. To the best of our knowledge, only Kramer et al. (2017) have specifically explored effects of unnatural variability—planar inversion or contrast negation—on face learning. Subsequent to watching a TV show in either normal upright display, participants’ recognition of seven actors was significantly poorer under unnatural compared with natural test display of images. Learning was significantly poorer under unnatural viewing conditions with no differences between testing conditions, suggesting exposure to natural variability was key to learning (cf. Dowsett et al., 2016; Matthews & Mondloch, 2018; Ritchie & Burton, 2017). However, Kramer et al.’s (2017) study did not directly compare relative contributions of diagnostic shape and surface properties to learning from within-person variability. Here, we have taken the first steps toward investigating important sources of information in processing facial identity from within-person variability. As we suggested earlier, there is limited evidence as to whether face learning occurs during the card-sorting task, though other studies suggest learning might occur from comparing photo-cards under different task demands (see Dowsett et al., 2016; Matthews & Mondloch, 2018). We expect the extent to which diagnostic shape and surface properties are important for successful face matching and learning from sources of variability will contribute to our understanding of theory and practical applications.
Might a less difficult task help illuminate whether contrast negation continues to impair card-sorting performance? One way we could have investigated this is by using a two-sort task where participants are informed of the correct number of groups (Andrews et al., 2015). The two-sort task dramatically improves sorting performance even with unfamiliar faces (Andrews et al., 2015). Kramer, Manesi, et al. (2018) used this task in their third experiment and found photo-cards containing only the external features contain less diagnostic information about facial identity (i.e., performance was still poorer even when knowing there were only two familiar faces in the photo-cards). In our study, we intended to examine sorting behavior where the number of identities was ambiguous to our participants, a situation that is likely to occur in many realistic contexts of face identification when attempting to discriminate between images of unfamiliar people. Therefore, it might be reasonable to expect a two-sort task to improve sorting of unfamiliar faces but with reduced sensitivity for contrast-negated photo-cards (cf. Kramer, Manesi, et al., 2018). Furthermore, our study was concerned with processes related to identification of unfamiliar faces, but what would emerge with sorting of familiar faces? While performance in sorting familiar faces by identity has been shown to almost perfect in unconstrained card-sorting tasks (Jenkins et al., 2011; Kramer, Manesi, et al., 2018; Zhou & Mondloch, 2016), contrast negation has been shown to significantly impair recognition of familiar faces in categorization tasks (Sandford & Rego, 2019; Sandford et al., 2018; see also Gilad et al., 2009 and Sormaz et al., 2013, for different tasks). However, contrast negation does not harm categorization of unfamiliar faces by familiarity (Sandford et al., 2018). In a card-sorting task, we might expect the usual superiority of familiarity to disappear (as reported by Sandford & Burton, 2014, in a study of configuration), at the least, if not result in even poorer performance compared with unfamiliar faces.
In all experiments, stretching the photo-cards did not impair sorting performance in any dependent variable. Was this due to showing the stretched photo-cards in the second condition in all experiments? We did not show the stretched photo-cards as the first (or last) condition in our study because previous research has shown shape plays a lesser role for recognition of familiarized and familiar faces (e.g., Itz et al., 2016, 2017; Kaufmann & Schweinberger, 2008). We consistently found stretched photo-cards were sorted without impairment compared with original photo-cards in all experiments whether they were preceded by contrast-negated or original photo-cards. Given that linear stretching also does not impair categorization of unfamiliar faces as unfamiliar (see Sandford et al., 2018), we expect presenting the stretched photo-cards before the original or contrast-negated photo-cards would have made little difference to our results. This is not to say that shape information does not contribute to identification of faces because caricaturing shape appears to help learning of unfamiliar faces (e.g., Itz et al., 2016, 2017; Kaufmann & Schweinberger, 2008). The difference between these results and our findings is twofold: Caricaturing exaggerates shape in qualitatively and quantitatively different ways than linear stretching and participants in our studies sorted photo-cards by identity rather than memorized faces. Taken together, we suggest close inspection of the ways in which shape is distorted and task demands give rise to different findings.
Our study also provides some interesting insights into the card-sorting task itself. First, our data suggest participants did not learn identities by the end of each experiment. Average numerosity and response bias data suggest participants were unaware they were sorting only two identities (as found with original photo-cards elsewhere, e.g., Balas et al., 2019; Balas & Pearson, 2017; Kramer, Manesi, et al., 2018). Response biases remained consistent between the stretched and third (original or contrast-negated) condition in all experiments reported here. Results also showed continued (no greater or lesser) fragmentation of contrast-negated photo-cards when presented in the last condition (Experiments 1B and 2B) and reduced sensitivity to facial identity between persons relative to the other conditions. Whether this means participants cannot learn identities from card-sorting tasks requires more data beyond that which is currently available (e.g., Andrews et al., 2015). Second, in our study, and in others (e.g., Balas & Pearson, 2017; Kramer, Manesi, et al., 2018), participants made far fewer different-person/same group errors relative to same-person/different group errors, suggesting a particular difficulty with within-person variability. We are not the first to indicate the difficulty of card-sorting tasks rests with this type of variability (Jenkins et al., 2011). Range of errors across conditions (averaged across our experiments) was 43.2% to 69.2% for same person/different group errors and 2.0% to 6.5% for different person/same group errors. These averaged different person/same group errors were above zero (supported by one-sample t tests, p ≤ .051), which suggests our chosen identities were difficult enough for participants in our study. Third, the median number of groups made by the first 20 participants of Experiments 1B and 2B (where original photo-cards were sorted first) was different compared with two other reports who used the same two identities: 5.0 in our study, 7.5 in Jenkins et al. (2011), and 6.0 in Andrews et al. (2015). We do not expect these differences are due to the use of color images compared with the grayscale images reported in these two previous studies (see Kramer, Manesi, et al., 2018 for discussion). Instead, we highlight the variability in performance across our experiments (see Figures 2 to 5), where median numerosity was 7.0 for sorting original photo-cards in Experiments 1B and 2B (N participants = 72), because we expect sorting behavior differs between participants in our experiments and participants in previous studies (in identical card-sorting conditions). We suggest more data are required to better individual differences in card-sorting tasks and their influence, if any, on related facial identity tasks (cf. Johnson et al., 2020; Stacchi et al., 2020).
Why would variability in performance emerge between different studies? Where methodology is identical, we suggest that this might be due to the specific images used in card-sorting tasks. While our criteria for downloading images of the same identities were largely similar, Jenkins et al. (2011) also reported downloading the first 20 images that met their criteria (i.e., images exceeded 150px in height, showed the face in roughly frontal aspect, and were free from occlusions). We presume that images on the internet are constantly updated due to passage of time (e.g., new photographs), or are otherwise edited (e.g., removed from image searches), and so it would be reasonable to assume we did not use the exactly same 20 images among our 60 images per identity. Even if we did use the same 20 images, it is unlikely (though not impossible) that all of these images were allocated together to the same condition. Together, this matters given facial identity processing of unfamiliar faces is bound to specifically seen images (Hancock et al., 2000). In addition, using different identities is likely another source of variability in results between studies. As we had suggested in our description of Figure 1, we acknowledge that each person’s variability in facial appearance is unique (Burton et al., 2016). This might make discussions about generalizability somewhat problematic; and we do not suggest our results will generalize precisely to another set of identities. There are some data that support our assertion. For example, Kramer, Manesi, et al. (2018) reported different average number of groups for two different sets of two identities with each set showing different sexes (the first set showed the same identities we used with M numerosity = 7.58 and the second set M numerosity = 4.65). Generally, patterns of results have been similar across card-sorting tasks; indeed, the results in our original condition generally replicate previous results. Therefore, we suggest our results likely would replicate in patterns of results in future card-sorting tasks but probably would not produce precisely the same results (even with the same set of images as we used due to variability in participants’ card-sorting behavior). We expect the precise sources of variability in participant performance and facial appearance to be continued areas of research.
In conclusion, we show across four experiments that card-sorting behavior with original photo-cards is no different from sorting stretched photo-cards, which suggests altered perception of feature shape, interfeature metric distances, and overall shape of the face due to linear stretching in the height dimension do not contribute to discrimination of facial identity in the card-sorting task. This contrasts with the relatively poorer sorting of contrast-negated photo-cards that consistently impaired sensitivity to facial identity between persons and, in some cases, greater fragmentation of photo-cards suggesting inability to cohere images into a single identity (or, here, two identities). We emphasize interpreting these results in the context of the mechanisms engaged during a card-sorting task (i.e., simultaneous processing sources of within- and between-person variability). To the extent that our results might generalize to other face tasks, we suggest future research considers the relative role of shape and surface properties in face learning, recognition, and verification by using naturally occurring (ambient) images (cf. Balas et al., 2019) or emulating realistic contexts (cf. Kramer et al., 2017).
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by a University of Guelph-Humber Research Grant awarded to Sandford.
