Abstract
According to some theories, the context in which a spoken word is heard has no impact on the earliest stages of word identification. This view has been challenged by recent studies indicating an interactive effect of context and acoustic similarity on language-mediated eye movements. However, an alternative explanation for these results is that participants looked less at acoustically similar objects in constraining contexts simply because they were looking more at other objects that were cued by the context. The current study addressed this concern whilst providing a much finer grained analysis of the temporal evolution of context effects. Thirty-two adults listened to sentences while viewing a computer display showing four objects. As expected, shortly after the onset of a target word (e.g., “button”) in a neutral context, participants saccaded preferentially towards a cohort competitor of the word (e.g., butter). This effect was significantly reduced when the preceding verb made the competitor an unlikely referent (e.g., “Sam fastened the button”), even though there were no other contextually congruent objects in the display. Moreover, the time-course of these two effects was identical to within approximately 30 ms, indicating that certain forms of contextual information can have a near-immediate effect on word identification.
Common to most models of spoken word recognition is the supposition that multiple lexical entries are activated in parallel according to their perceived match to the unfolding acoustic input (Gaskell & Marslen-Wilson, 2002; Luce & Pisoni, 1998; McClelland & Elman, 1986; Norris, 1994). For example, Marslen-Wilson and Welsh (1978) proposed that, at the onset of a single spoken word, a cohort of potential lexical candidates is initially activated and is then progressively whittled down as more acoustic information becomes available. Subsequent models also allow for partial activation of words that share acoustic features later on, even if they do not match at onset (see Dahan & Magnuson, 2006).
These models are based primarily on data from studies in which words have been presented out of context, such that listeners have no prior expectations about the word identity, other than the word's frequency of occurrence. However, in everyday listening situations (as opposed to psycholinguistic experiments), words are seldom heard in complete isolation, and there are various sources of contextual information that could, at least in principle, place constraints upon lexical activation. While there is little doubt that context can affect the identification of words, it remains unclear precisely when in the process of speech perception such context effects take place. A long-standing view is that lexical candidates are initially accessed entirely on the basis of bottom-up sensory information, with contextual information only playing a role during the subsequent selection of a single best fitting candidate (Marslen-Wilson, 1984; Norris, 1986) or integration of the word meaning into the higher level sentence representation (Forster, 1976; Tanenhaus, Carlson, & Seidenberg, 1985). According to such models, hearing the sentence “Joe fastened the button” will momentarily activate the lexical representation of “butter”, even though butter is clearly an incongruent completion to the sentence. In contrast, other researchers have argued in favour of earlier effects of context (e.g., Dahan & Tanenhaus, 2004; Lucas, 1999), although, as discussed below, current evidence for immediate effects is problematic.
Until fairly recently, most of the evidence relating to context effects on word identification came from the cross-modal semantic priming paradigm. In one such experiment, Onifer and Swinney (1981) presented spoken sentences including a homophone—a semantically ambiguous word such as “organ”—followed by a written target word or nonword to which the subject made a lexical decision response. Reaction times for associates of one meaning of the homophone (e.g., KIDNEY) were reduced, even when the context was biased towards the alternative meaning of the homophone. The authors argued, therefore, that context effects were delayed, such that both (all) meanings of the homophone were initially accessed. Nevertheless, their data showed a numerical effect of context, with response times being relatively slower for associates of the contextually inappropriate meaning. This trend was confirmed by Lucas (1999) in a meta-analysis of 17 similar priming studies. Lucas interpreted this as evidence for an early effect of context. However, the problem for all such priming studies using homophone stimuli is that behavioural responses are made some time after the offset of the homophone (even when the target word is presented prior to homophone offset). Thus, it is difficult if not impossible to determine the point in processing at which the effect of context kicks in.
To address this issue of timing, Zwitserlood (1989) presented Dutch-speaking participants with auditory sentences that ended in an incomplete fragment of a prime word. Lexical decision times to cohort competitors of the prime word were significantly reduced. For example, the initial fragment of “kapitein” (captain) primed the word “geld” (money)—an associate of the cohort competitor “kapitaal” (capital). This was true, even when the context was biased towards the “kapitein” interpretation, leading Zwitserlood to conclude that context does not constrain the initial activation of cohort competitors. However, the contextual constraints were not particularly strong—disambiguating information was presented in the sentence prior to that carrying the prime, and the cohort competitor often remained a plausible if less likely completion to the sentence. Moreover, Janse and Quené (2004) criticized this study for employing an incomplete within-subjects design, meaning that the reported effects were confounded with between-subjects differences in effect size. Indeed, simulations indicated an extremely high (62%) probability of showing a priming effect when none was present in the data.
A number of more recent studies have explored similar issues by measuring event-related potentials (ERPs) to contextually congruent and incongruent words (van den Brink, Brown, & Hagoort, 2006; Van Petten, Coulson, Rubin, Plante, & Parks, 1999). Of particular note, Hagoort and colleagues (Hagoort & Brown, 2000; van den Brink, Brown, & Hagoort, 2001) identified an early negative component of the ERP, peaking at around 250 ms after the onset of the word. This component was significantly reduced if the word was congruent with the preceding sentence context or was a cohort competitor of a congruent word. These findings were interpreted as evidence for an early but not immediate effect of context on lexical processing (cf. Marslen-Wilson, 1984). However, early ERP components are typically much shorter in duration than later components, making them highly susceptible to stimulus variation that result in temporal variability in the brain response (Penolazzi, Hauk, & Pulvermuller, 2007). Thus, one cannot rule out the possibility of even earlier effects on brain responses.
Further evidence comes from studies using the so-called “visual world” paradigm, in which participants' eye movements are recorded as they listen to spoken sentences (Cooper, 1974; Tanenhaus, Spivey-Knowlton, Eberhard, & Sedivy, 1995). Two such studies have investigated the effects of sentence context on homophone processing, providing results broadly comparable to the corresponding semantic priming studies described earlier. Huettig and Altmann (2007) found that, on hearing the homophone “pen” in a sentence biased towards the “animal enclosure” interpretation, participants looked more at a writing pen (or a visually similar object such as a needle) than at an unrelated object such as a bicycle. However, the contextual constraints were not particularly strong, and both meanings remained plausible. Chen and Boland (2008) reported similar findings but noted that looks towards the object corresponding to the inappropriate homophone were modulated by context. While these authors interpreted their findings in terms of context effects on lexical access, it is important to note that the context effect did not achieve statistical significance until some 600 ms after the onset of the homophone.
Other studies have looked at eye movements directed at cohort competitors of target words. In a study with Dutch-speaking participants, Dahan and Tanenhaus (2004) reported that, on hearing the word “kanon” (cannon) in a neutral context, participants looked more at a cohort competitor, camel (“kameel”), than at other phonologically unrelated objects in the display. However, biasing sentence contexts abolished this cohort effect. For instance, if the word “kanon” was preceded by the verb “roest” (“rust”) for which camel is an implausible subject, looks towards the camel were significantly reduced and were not in fact significantly different to looks towards a phonologically unrelated object. Similar findings have since been reported by Barr (2008) using English stimuli, by Weber and Crocker (2012) in German, and by Revill, Tanenhaus, and Aslin (2008) using an artificial language. In each case, a constraining verb reduced fixations on pictures that were implausible completions of the sentence. However, in all four of these studies, the reduction in cohort competitor fixations in the biased condition was accompanied by an increase in fixations on the target object, which was a thematic fit for the verb. This represents a potentially serious confound because participants can only physically look at one object at a time. The effect of context on target fixations was apparent well before the onset of the cohort effect and, in some cases, even before the onset of the target word, suggesting that the effect is not mediated by lexical entries of the corresponding objects (Altmann & Kamide, 1999). Early effects of context on competitor fixations may simply be a by-product of the (even earlier) increase in target fixations and do not, therefore, constitute evidence for early effects on lexical processing.
Finally, Magnuson, Tanenhaus, and Aslin (2008) investigated the effects of form-class constraints on cohort competitor effects. Participants were trained to associate novel words with different textures and shapes, treated as adjectives and nouns, respectively. When pragmatic considerations led participants to expect texture information, they looked less at objects whose shape was a cohort competitor of the texture word (and vice versa). Unlike in the studies using verb-mediated context, participants were not able to anticipate which specific object would be referred to. Nevertheless, the pragmatic context primed them to attend to one perceptual dimension of the stimuli (colour or texture), and it could be argued that this effect on visual attention was responsible for the context effects. Moreover, as the authors acknowledged, it is as yet unclear whether their findings generalize to natural language.
A further limitation of all the above-mentioned eye-tracking studies is that, while the authors have routinely claimed evidence for immediate effects of context on eye movements, the analyses conducted have not actually allowed the testing of such a claim. Dahan and Tanenhaus (2004) considered the probability of gazing at the cohort competitor averaged across a window between 200 and 500 ms after the onset of the target word. Even allowing for a 200-ms lag between the cognitive event and the corresponding eye movement (Matin, Shao, & Boff, 1993; but see Altmann, 2011), this still means that context effects could have arisen at any point during the acoustic lifetime of the word. Similarly, Magnuson et al. (2008) and Weber and Crocker (2012) considered gaze probability averaged across a 200–700-ms window. Chen and Boland (2008) conducted analyses at 100-ms intervals but, as already noted, did not find a significant context effect prior to 600 ms. Barr (2008), in contrast, adopted a curve-fitting approach, which considered the temporal evolution of gaze likelihood, but assumed a priori that the context effect began at 200 ms.
In sum, despite strong claims in the literature, there is still a lack of compelling evidence that context effects are immediate. As such, accounts that assume an initial period of purely bottom-up lexical processing are yet to be convincingly refuted. Here, we present evidence from an eye-tracking study that, we argue, provides such evidence. Our design was based on the study by Dahan and Tanenhaus (2004) described earlier. However, rather than presenting targets and competitors together in the same display, on each trial we presented a target, a competitor, or an unrelated object—in each case alongside three unrelated distractors. Thus any effect of context on competitor fixations could not be explained away in terms of confounding effects on target gaze.
This design also allowed us to take full advantage of the exquisite temporal precision afforded by eye-tracking technology and investigate the time-course of context effects on saccades towards the competitor. Specifically, we asked, for each moment in time following the onset of the target word, whether or not there was evidence for a cohort effect or context effect in the eye-movement record (see McMurray, Clayards, Tanenhaus, & Aslin, 2008, for a similar approach). This allowed precise investigation of the time-course of different linguistic effects, whilst making no assumptions about either the shape of the curve or the lag between the linguistic event and the corresponding eye movement. If context effects are immediate then the time-course of these two effects should be near identical. If, however, there is an initial period of context-free lexical access then there should be an appreciable lag between the cohort effect and the context effect, and participants should show a cohort effect (however brief), even when sentence context is biased against the competitor.
Method
Participants
Participants were 32 undergraduate students from Macquarie University, aged 18 to 23 years, all with normal or corrected-to-normal vision. Informed consent was obtained prior to the commencement of testing, and participants were rewarded with course credits.
Stimuli
Sentences were recorded by a female native English speaker using natural intonation but with small pauses between words. One token of each word was used in the experiment, with sentences constructed by concatenating the individual words. Thus, intonation patterns were relatively consistent across sentences, and the critical word tokens were identical across conditions. Each sentence described an agent acting upon an object and was composed of four words—a gender ambiguous name (e.g., Sam); a past tense verb; the definite article; and then the object noun, which was the target word (see Brock, Norbury, Einav, & Nation, 2008, for a full list of stimuli). In constrained sentences, the verb was chosen such that it was strongly associated with the object of the sentence (e.g., “Joe fastened the button”). In neutral sentences, the verb was always “chose” (e.g., “Sam chose the button”). Agents were pseudorandomly chosen from four gender-neutral names.
Visual stimuli were photo-quality pictures on a black background, taken primarily from the Hemera Photo-Objects Collection. Stimulus displays each contained four objects, located in the centre of each quadrant of the screen (see Figure 1). One of these was the critical object—which was the target object mentioned in the sentence, a cohort competitor of the target, or an unrelated object. The remaining three were distractors, semantically unrelated to the verb and phonologically unrelated to the critical object.

Example stimulus display with interest areas overlain. The scan path shows fixations and saccades for a representative subject in the 1,500 ms immediately following the onset of the target word “button” in the neutral sentence “Joe chose the butter”. To view a colour version of this figure, please see the online issue of the Journal.
Design
A fully within-subjects design was employed (see Table 1). The three conditions involving the competitor and unrelated objects allowed evaluation of the cohort effect and the effect of sentence context. The target-neutral condition provided an index of when there was sufficient acoustic information to allow reliable identification of the target word. In each of these four conditions, the location of the critical object was fully counterbalanced, and the same subset of 16 objects played the role of critical object across the 16 trials, thus controlling for the intrinsic salience of the different objects (Kamide, Altmann, & Haywood, 2003). A fifth condition, target biased, was included to ensure that the probability of the target being present was the same across neutral and biased contexts. Objects used as targets in these filler trials were not critical objects in the experimental trials.
Conditions and example stimuli
Procedure
Eye movements were recorded using an EyeLink 1000 tower-mounted eye-tracking system sampling at 500 Hz, which required that the head was maintained in a fixed position on a chin rest, with a viewing distance of 40 cm. Stimuli were presented using Experiment Builder software.
At the start of the experiment, the visual stimuli were presented, one by one in a random order. Participants were required to name each object. If they did not provide the required name, they were informed by the experimenter of the correct response, and the trial was repeated at the end of the sequence.
Next, four practice trials were completed, each of which involved four stimuli that were not critical items in the main experiment. Participants viewed the displays for 4 seconds and then heard a spoken sentence. Their instructions were to use the mouse to click on any object that was mentioned in the sentence as quickly as possible. If none of the objects was mentioned, they were to simply wait for the next trial. Trials concluded 1 second after the mouse click or, if there was no mouse response, 4 seconds after the onset of the final word of the sentence.
Test trials were completed in three blocks of 24 trials. Prior to each block, the eye-tracker was calibrated using a standard 9-point calibration routine. Each participant received the same experimental items but in a different fully randomized order.
Analysis
Areas of interest were rectangles, 25% of the screen height and width, centred on the critical object in the display (see Figure 1). Interest area reports were generated using DataViewer software for an interest period beginning at the onset of the target word and ending an arbitrary 1,000 ms later (we anticipated that effects of interest would occur much earlier than this).
Conventionally, language-mediated eye-movement data are analysed by conducting a by-subjects t test or analysis of variance (ANOVA) on the probability of fixation on a particular type of object averaged across a prespecified time window. However, this is problematic because it falsely assumes that the proportion data have a normal distribution and that consecutive observations are independent (Barr, 2008). Moreover, interpretation of the gaze probability measure is complicated by the fact that it conflates the effect of context on saccades landing after the onset of the target word with context effects apparent at target word onset, and with saccades away from the object in that period. A common solution to this problem is to exclude trials on which the participant is already gazing at the target at onset. However, this approach can itself introduce biases because one is preferentially excluding those trials on which there might be a saccade away from the target (Barr, Gann, & Pierce, 2011).
Our analyses, therefore, focused on saccades towards the critical object as a function of time since the onset of the critical word (see Altmann & Kamide, 1999, for a related approach). For each 10-ms segment of each trial, we coded as a binary variable whether or not the subject had looked at the critical object at any point in time since the onset of the target word. Trials on which the participant was already fixating on the critical object at target word onset were excluded from further analyses. Crucially, because we were interested only in saccades towards the critical object, this exclusion approach does not bias the results as it does for gaze plots (cf. Barr et al., 2011). Averaging across subjects and items produced a cumulative fixation probability plot (Figure 3A).

Gaze probability for critical objects.
These data were then subjected to mixed effects analyses at each time point using the lme4 package in R (Bates, 2005). This was essentially a model-building exercise. The initial model included condition as a fixed effect and subject as a random effect. Subsequently, it was found that adding the location of the critical object (fixed effect) and the identity of the target picture (random effect) significantly improved the fit of the model to the data at the majority of time points. These factors were, therefore, included in the final model. The identity of the target word (random effect) did not enhance the model so was omitted. More complex models involving interactions between the various factors were also considered but did not provide a significantly better fit to the data at any time point.
Results
To allow comparison with previous studies, Figure 2 shows the eye-tracking data plotted conventionally in terms of gaze probability as a function of time since the onset of the target word. Qualitatively, the results are very similar to those reported by Dahan and Tanenhaus (2004) and Barr (2008), despite the fact that the target was absent from the display in the critical conditions.
Figure 3A shows the cumulative probability of saccading towards the critical object. As expected, in neutral sentences, hearing the target word resulted in increased saccades towards the cohort competitor compared with unrelated objects. Importantly, this effect was markedly reduced in the constraining context.

(A) Cumulative probability of saccading towards the critical object. (B) Z-statistics for the three effects of interest. Cohort: competitor neutral—unrelated neutral; context: competitor neutral—competitor constraining; context-free: competitor constraining—unrelated neutral. Dotted horizontal line indicates the estimated threshold for significance (α = .05).
Figure 3B shows the z-statistics for the mixed effects analyses comparing these three conditions at each time point. The cohort effect first achieved statistical significance (α = .05) at 310 ms. The context effect closely followed: It was marginally significant (p < .10) at 310 ms, became significant at 340 ms, z = 2.17, p ≈ .030, and remained significant thereafter.
More importantly, the above results entailed that there was no appreciable lag between the cohort effect and the context effect. Indeed, if there had been such a lag then one would expect a significant difference between fixating a contextually incongruent competitor and fixating an unrelated distractor. However, this difference remained nonsignificant throughout the epoch, peaking at 350 ms, z = 1.07, p ≈ .285.
Discussion
The current study provides the most compelling evidence to date for immediate effects of sentence context on word identification. When participants heard a target word in a neutral context, they tended to look towards objects that shared the same onset, indicating that these competitors were considered as potential referents. However, this effect was significantly reduced when the same words were heard in a constraining sentence context that made the competitor objects unlikely referents.
Our results are consistent with those of a number of previous eye-tracking studies, which have been taken as evidence for immediate context effects (Barr, 2008; Dahan & Tanenhaus, 2004; Weber & Crocker, 2012). However, none of these previous studies actually determined when the effect of context became apparent, with authors either averaging across an extended time window or modelling gaze probability with onset time assumed rather than tested. In contrast, we were able to track the emergence of linguistic influences on eye movements across time. The cohort effect became statistically significant 310 ms after the onset of the target word. Allowing for a 100–200-ms lag between the initiation and completion of a saccade (Matin et al., 1993), this accords well with the suggestion that the time to access the mental lexicon corresponds to the first 100 to 200 ms of a word (Marslen-Wilson, 1984; Salasoo & Pisoni, 1985). Throughout the epoch, the effect of context on saccades towards the cohort competitor closely tracked the cohort effect, becoming statistically significant a mere 30 ms later and remaining significant thereafter.
A potential criticism of our analytical approach is that, by conducting analyses at 100 different time points, we were capitalizing on chance. However, it is important to note that we were not “fishing” for a context effect anywhere in the epoch. Indeed, of the 70 time points at which there was a significant cohort effect (and thus a reduction in the cohort effect was possible), the context effect was significant in 67 and marginally significant in the remaining three. In contrast, at no point was there even a remotely significant tendency to fixate on contextually inappropriate competitors above baseline levels, as would have been expected if access was initially context free.
A further important innovation concerned the removal of the target object from the stimulus display for critical trials. In previous studies, the target and competitor were always both on screen. Thus, effects of sentence context on gaze at the cohort competitor were impossible to disentangle from context effects on target-directed gaze. Our assumption was that, by removing the target from the display, gaze on the critical object at target onset would be equated across conditions. In fact, there was a small initial reduction in gaze at the competitor in constraining contexts. Although the competitor and distractors were all inconsistent with the constraining verb, it is possible that the distractors were on average somewhat more likely completions to the constraining sentences than the competitors. This bias at onset slightly exaggerated the context effect in the gaze plot (Figure 2). Crucially, however, our analyses were based on saccades towards the competitor after onset, excluding trials on which the relevant object was fixated at onset. As the cumulative fixation plot (Figure 3A) shows, the three critical conditions were almost perfectly equated throughout the “baseline” period before the onset of eye movements triggered by the target word.
Taken in isolation, the current results could be construed as support for a sequential model of speech comprehension, whereby the set of possible words is progressively reduced as more input is received. On this view, preceding context eliminates semantically implausible words in much the same way that phonologically incompatible words are progressively weeded out in Marslen-Wilson and Welsh's (1978) cohort model. However, other recent eye-tracking findings indicate that context effects can be overridden to some extent by later occurring articulatory or lexical cues. For example, Dahan and Tanenhaus (2004) reported that cohort effects re-emerged if coarticulation cues favoured the (contextually inappropriate) cohort competitor over the target. Similarly, Weber and Crocker (2012) reported that cohort effects were not abolished if the semantically inconsistent cohort competitor was of higher frequency than the semantically preferred target.
Together with the current findings, such results point towards a more dynamic view of speech perception whereby the probability attached to a lexical candidate at any moment is a joint function of its lexical frequency, its compatibility with preceding contextual information, and the acoustic match up to that point in time (Dahan, 2010). However, the precise mechanisms involved remain underdetermined. Our findings addressed the enduring question of when context effects become apparent—but this is orthogonal to the question of how different sources of information are combined (Twilley & Dixon, 2000). Immediate context effects could reflect direct influences on bottom-up processing, as in the TRACE model of speech perception (cf. McClelland & Elman, 1986), but could arise at any point up to the decision that guides ocular or manual responses, with no direct interaction between top-down and bottom-up processes (Norris & McQueen, 2008; Twilley & Dixon, 2000). Determining between these alternative accounts is beyond the scope of the current study. Nonetheless, our results place an important constraint on future model development—semantic context can and, at least in some circumstances, does have an immediate effect on spoken word recognition. Candidate models that do not allow for this possibility can, we suggest, finally be eliminated.
Footnotes
Acknowledgements
The study was supported by Australian Research Council Discovery Project DP098466 and a Macquarie University Research Development Grant. It was based on an earlier unpublished study, conducted at Oxford University, funded by the Medical Research Council, and presented at the 2007 Experimental Psychology Society meeting in London. We thank Lucy Cragg for assistance creating the stimuli, Samantha Bzishvili for data collection, and Sachiko Kinoshita for helpful comments on an earlier draft of this paper.
