Abstract
The current study compared the effectiveness of computer-delivered task-essential practice coupled with feedback consisting of (1) negative evidence with metalinguistic information (NE+MI) or (2) negative evidence without metalinguistic information (NE–MI) in promoting absolute beginners’ (n = 58) initial learning of aspects of Latin morphosyntax. This study measured language development on a variety of dependent measures (three comprehension-based tests and one production test), assessing both changes in accuracy and reaction time as well as examining effects on trained (old) vs. untrained (new) items. Although participants under both conditions improved in accuracy and reaction time on all measures, on immediate post-tests, participants receiving metalinguistic information outperformed those who did not. However, this advantage had largely dissipated by the time of the delayed tests. Performance on untrained items also suggests an advantage for metalinguistic feedback on system learning and on transfer of skills from comprehension-based practice to production. Furthermore, we argue, based on findings in cognitive neuroscience, that greater maintenance of gains in accuracy as well as evidence of some faster processing by participants not exposed to metalinguistic information may reflect qualitatively different learning processes at work: more explicit learning in the [NE+MI] group and more implicit learning in the [NE–MI] group (Li, 2010).
I Introduction
The role of feedback in second language acquisition (SLA) is a topic of both theoretical and practical importance. Many SLA studies have investigated the effects of feedback in the context of face-to-face interaction (e.g. Carroll & Swain, 1993; Leeman, 2003; Mackey, 1999; Mackey, Gass, & McDonough, 2000; Mackey & Philp, 1998; McDonough & Mackey, 2006) and in computer-mediated communication (CMC) (Sachs & Suh, 2007; Sagarra, 2007). Results of recent meta-analyses indicate an overall positive effect for feedback in both types of interaction (Keck, Iberri-Shea, Tracy-Ventura, & Wa-Mbaleka, 2006; Mackey & Goo, 2007), and specifically for corrective feedback provided during interaction (Russell & Spada, 2006) on second language (L2) development. Li’s (2010) metanalysis showed that implicit feedback produces larger long-term and therefore more reliable effects than explicit feedback. 1
Recent research has also incorporated the use of computer assisted language learning (CALL) applications (e.g. Rosa & Leow, 2004; Sanz, 2004) to examine the effects of feedback. As opposed to CMC, where the computer acts as a tool to facilitate interaction that may lead learners toward negotiation and ultimately language development (Sanz & Lado, 2008), in CALL the computer is a tutor, and feedback is usually ‘immediate, provided only when needed, individualized, and focused on the key form’ (Sanz, 2004, p. 12). In this way, as stated by Nagata and Swisher (1995), ‘the learners’ attention (is drawn) to weaknesses in their mastery of grammatical points that might not become apparent in the classroom’ (p. 339).
While feedback has often been the focus of research in interaction studies, whether classroom or CMC, fewer studies have investigated the effects of different types of feedback in more controlled (laboratory) settings, and especially in CALL. Of the handful of dissertations and published studies, some have shown that learners exposed to feedback consisting of negative evidence with metalinguistic information outperform those exposed to feedback including negative evidence alone, on both immediate and delayed post-tests (Bowles, 2005; Nagata, 1993; Nagata & Swisher, 1995; Rosa & Leow, 2004). However, non-significant differences in performance on immediate tests have also been found (Camblor, 2006; Hsieh, 2007; Moreno, 2007; Sanz, 2004; Sanz & Morgan-Short, 2004). In addition, some results on delayed tests suggest that there may in fact be longer-term benefits for exposure to negative evidence alone (Moreno, 2007; Stafford, Sanz, & Bowden, 2012), in line with Li’s (2010) conclusions on feedback in interaction research.
Divergent results in many of these studies together with the importance attached to feedback by language teaching practitioners suggest the need to take a closer look at the differential effects, across time, of feedback with or without metalinguistic information on the development of non-primary languages. In addition, a closer examination of the effects of feedback as assessed by different but complementary measures (accuracy and reaction time) will provide a more detailed picture of the effects of such feedback on SLA. Comparing these measures of performance across a variety of comprehension- and production-based tasks, which include both trained and untrained items (i.e. items that are or are not part of the treatment) will also help to refine our knowledge of the role of different types of feedback on linguistic development. Importantly, such insights will inform language pedagogy, especially with respect to lesson planning and curriculum design with the goal of optimizing long-term effects on language development.
II Previous research on computer-delivered feedback
A study reported in Nagata (1993) and Nagata and Swisher (1995) compared what they called intelligent feedback (I-CALI) (with metalinguistic information) or traditional feedback (T-CALI) (without metalinguistic information) on the learning of three types of Japanese passive constructions. Participants (n = 32) read grammar explanations about the target form and then practiced producing the target form (90 instances). The feedback in the T-CALI group informed participants of a missing or expected word in their response. The I-CALI group received metalinguistic information in addition. The results showed that the metalinguistic group significantly outperformed the non-metalinguistic group on both post and delayed (three weeks after the treatment) production measures. The results of Rosa and Leow (2004) partially converge with those of Nagata (1993) and Nagata and Swisher (1995).
Rosa and Leow (2004) had 5th-semester L2 Spanish college students (n = 100) complete a multiple-choice jigsaw puzzle task to learn contrary-to-fact past conditional constructions. The study compared 6 conditions (number of items = 18 per condition) that varied in degree of explicitness; we focus here on their (IFE) group, whose participants were given negative evidence, and the (EFE) group, which was also provided with metalinguistic information. Results were mixed and task-dependent: the negative evidence/metalinguistic (EFE) group outperformed the negative evidence alone (IFE) group on both immediate and delayed tests for production of old items and interpretation of new items, but no difference between groups was found over time for the recognition of old items or the production of new items. The limited exposure to the target may not have been enough for learners in the negative evidence (IFE) condition to show evidence of development, since learning under such a condition may require more instances of the target form (N. Ellis, 1993; 2005).
Sanz (2004) and Sanz and Morgan-Short (2004) addressed this limitation and investigated whether exposure to structured-input practice items (oral and written, total of 56 items), coupled with feedback that offered negative evidence with metalinguistic information (+EF) or negative evidence without metalinguistic information (–EF), had a differential effect on L2 acquisition of Spanish preverbal direct object pronouns (n = 33). Results suggested that the amount of interaction with the target form may be critical, as no statistically significant difference between the two groups on interpretation or production post-tests was identified. However, and importantly for the present study, the design did not control the amount of individualized feedback each participant received, and retention of knowledge gained was not tracked.
Four additional unpublished dissertation studies (Bowles, 2005; Camblor, 2006; Hsieh, 2007; Moreno, 2007) have also addressed the issue of type of feedback provided to learners in a CALL context and have incorporated conditions similar to those included in the aforementioned studies (i.e. negative evidence with or without metalinguistic information). Taken together, their results suggest that while provision of feedback is beneficial, +/– metalinguistic information comparisons do not always favor the metalinguistic feedback condition (Camblor, 2006; Hsieh, 2007; Moreno, 2007). Specifically, in Bowles’s (2005) study, providing metalinguistic information gave learners some immediate advantage when compared to learners who do not receive any grammar explanation as part of feedback, but in the long run this advantage disappeared. By contrast, in Moreno’s (2007) study, delayed advantages were observed for learners who received only information about the correctness of their answers to practice tasks, i.e. negative evidence alone. Limitations in these studies include potential problems with the scoring procedure, which may have accepted ungrammatical targeted items as correct (e.g. Bowles, 2005), and lack of statistical power, which may explain lack of differences between groups (e.g. Camblor, 2006).
Stafford and colleagues (2012) compared the effects of oral and written task-essential practice combined with negative evidence with and without metalinguistic information provided to Spanish–English bilinguals (n = 65) learning to assign semantic functions via noun case morphology in third-language (L3) Latin. Results from a grammaticality judgment test indicated that, unlike their counterparts, participants who received negative evidence alone not only learned but also retained what they learned over a period of at least three weeks. However, where transfer of skills from input practice to output assessment was involved, only the group exposed to negative evidence with metalinguistic information showed significant improvement. 2
In sum, studies on the differential effects of computer-delivered feedback have led to inconclusive results due in part to the limited number of studies that have investigated such effects. Additionally, research designs have almost always been biased against more implicit treatment conditions (in our case, without metalinguistic information) by providing not only a different type of input or feedback (in the language vs. about the language) but less input, as in Rosa and Leow (2004). Combining this situation with a lack of control of time on task often results in shorter implicit conditions (e.g. Sanz & Morgan-Short, 2004). Thus, research is needed that more equitably compares the effects of different degrees of explicitness of feedback, specifically negative evidence with and without metalinguistic information, in pedagogical conditions. This includes examining the possibility that feedback that consists of negative evidence without metalinguistic information leads to more stable knowledge, as suggested in studies on interaction (e.g. Li, 2010) and computerized feedback (e.g. Bowles, 2005; Moreno, 2007; Stafford et al., 2012).
Furthermore, the nature of the measurements themselves (R. Ellis, 2005) have tended to favor planning and explicit processing (e.g. untimed translation task in Bowles, 2005), which may benefit learners who have undergone treatments providing feedback with metalinguistic information. Along these lines, additional insight may be gained by examining the effects of different types of feedback on both speed of processing (reaction time; RT) and accuracy, given that accuracy alone cannot inform us about potential processing differences underlying what may be apparently indistinguishable gains. For example, longer RTs in one group as compared to another (where accuracy is equivalent) might suggest that the underlying behavior is different, with the slower group relying on controlled processes (Newell, 1990). Thus, examining RT can be an effective way to measure automaticity, i.e. ‘the ability to perform without conscious awareness or while utilizing minimum attentional resources’ (Jiang, 2007, p. 2). As Segalowitz (2003) claims, speed of processing is the characteristic most frequently associated with automaticity. Nevertheless, Segalowitz (2003) also argues that, more than just a synonym for ‘fast processing’, automaticity should be used for situations where the change is of significant consequence, such as restructuring of underlying processes’ (p. 387). In the present study, we combine both accuracy and reaction time as a way to look into learners’ efficiency in restructuring their underlying processes.
The present study aims to address several of the limitations in previous studies and to cast additional light on the role of feedback in L2 development. In addition to using computers to tightly control the amount of input, practice, and feedback provided, we also collected data through both receptive (input or comprehension-based) and productive measures, and gathered both accuracy and RT data. The tests also included both old and new items in order to provide information on item-learning and system-learning. Specifically, performance on trained items may be indicative of learners’ ability to remember chunks of language, whereas performance on untrained items indexes degree of success at system learning. By considering all these measures, we aim to provide a broader and deeper picture of language learning and retention.
In this study, we specifically examine how absolute beginners learn to interpret and produce the semantic functions of noun phrases (i.e. to decide who does what to whom) under instructional conditions that combine comprehension-based, task-essential practice with feedback that provides negative evidence but differs in the provision of metalinguistic information.
To find equilibrium between ecological validity, on the one hand, and control of external input and prior knowledge, on the other, the target language chosen was Latin, a natural language that is no longer spoken in its classical form, and specifically noun case morphology as the target form. We were guided by the following research questions: Does the presence or absence of metalinguistic information (MI) in combination with negative evidence (NE) in computer-delivered, task-essential practice differentially affect absolute beginners’ ability to assign semantic functions in Latin? Do these effects differ across different tasks or trained vs. untrained items?
III Methodology
1 Participants
Participants in the present study were 58 college students native speakers of English randomly assigned to one of two treatment conditions: NE+MI (n = 33) and NE-MI (n = 25). Participants’ age ranged from 18 to 22 years old. To control for previous language experience, we recruited participants who had no knowledge of Latin or any other case marking language and who were in a second-year Spanish program. We accepted participants with one or two semesters in a non-case language. Therefore, in some cases, Latin was the fourth language (L4) rather than the L3. Participants scoring 67% 3 or higher on the pre-tests were not included in the final sample. All participants were compensated for their participation with extra credit. 4
2 Target structure
The linguistic target of the study was the assignment of thematic agent/patient roles to nouns in Latin via case morphology. The theoretical framework that guided the design of our materials was the Competition Model (CM), developed by Bates and MacWhinney (1989). In Competition Model (CM) terms, language learning is defined as ‘a process of acquiring coalitions of form–function mappings, and adjusting the weight of each mapping until it provides an optimal fit to the processing environments’ (MacWhinney, 2001, p. 59). It is argued that when processing language, the assignment of functional meanings to grammatical forms in the input involves competition, which is governed by the ‘cue validity’ of the linguistic input. Cue validity refers to the availability (frequency of appearance), and reliability (degree to which a cue leads to the correct interpretation) of a particular cue in the input. In the present study, the targeted linguistic forms were noun and verb morphology that indicate thematic agent and patient roles in Latin (i.e. ‘who does what to whom’). In Latin, the strongest cue (i.e. the most available and reliable) is case morphology, followed by subject–verb agreement, and finally, word order. This state of affairs is reversed in the participants’ first language (L1), English, in which word order is the strongest cue. In contrast, in Spanish, the participants’ L2, verb agreement is the strongest cue, followed by word order (Bates & MacWhinney, 1989). System learning in CM terms is understood as the application of a new cue hierarchy to novel input; in the current study, system learning is investigated through the inclusion of novel test items.
Following task-essentialness principles (Loschky & Bley-Vroman, 1993) that have been shown to lead to reliable linguistic gains in Processing Instruction research (VanPatten, 2005), we manipulated the input so that participants would be encouraged to rely on noun and verb morphology to interpret sentences. Practice sentences were manipulated so that neither word order nor subject–verb agreement was a consistently reliable cue. Given the cue hierarchy of English, which has a strict subject–verb–object (SVO) word order, we predicted that when participants first read a sentence such as (1) Parvul-um specta-t angel-us. boy-masc.sing.obj. looks at-sing. angel-masc.sing.subj; ‘The angel looks at the boy.’
and were asked to choose between two possible English translations (2) a. The boy looks at the angel. b. The angel looks at the boy.
they would by default respond assuming SVO word order (their L1 cue), which would lead to an incorrect answer. Provision of immediate negative feedback might then lead them to restructure their system and shift their reliance to a more reliable cue, in this case, noun case morphology. 5 Given these manipulations and the fact that practice could not be performed successfully by relying solely on the lexical meaning of the nouns and verbs, on word order, or on verbal morphology, participants were led to process both noun and verb morphology as a means of successful task completion.
3 Experimental design
The experiment consisted of three sessions over four weeks; all took place in an Apple laboratory, where participants interacted with an application that combined ColdFusion and Flash programming to deliver audiovisual treatments and capture participants’ responses.
During the first session, participants completed a consent form and a background questionnaire followed by a computer-delivered vocabulary lesson and quiz. Next, they completed four pre-tests (written and aural interpretation, grammaticality judgment and sentence production). During the second session, approximately one week later, participants completed the computer-delivered treatment and four immediate post-tests. At the final session, two weeks later, participants completed four delayed post-tests and an online debriefing questionnaire.
a Vocabulary lesson. 6
Presentation of vocabulary was timed (12 minutes) as a means to control exposure. Each Latin noun (n = 35) was presented onscreen as follows: two pictures representing the noun in singular and plural appeared first, and then the singular and plural, masculine and feminine subject (nominative) and object (accusative) case forms (8 forms total) were presented aurally and in written form, followed by a written English translation. There was no explanation of what the noun morphology indicated, though written singular forms were presented under singular pictures and plural forms under plural pictures. Each Latin verb (n = 11) 7 was presented beginning with two pictures representing the action, followed by the infinitive verb form (written and aural) and the written English translation.
Immediately following the vocabulary lesson, participants were quizzed on the word meanings via a multiple-choice quiz. In order to ensure that vocabulary knowledge was sufficient for comprehension of word meanings in the practice session, participants with a score of 60% or higher on this quiz reviewed the vocabulary items they had missed until they reached 100% accuracy. Participants with a score below 60% repeated the entire vocabulary lesson and then the quiz until they reached 100%. Right after the vocabulary lesson and quiz, participants completed the pre-tests.
b Treatment: Task-essential practice and feedback
During the treatment, participants interacted with a computer-delivered task-essential practice session involving interpretation of written and aural Latin sentences. Practice consisted of 6 different tasks with nine or 10 items per task. All practice items included two answer choices, and in order to make the practice task-essential, the two answer choices had reversed roles for subject and object (e.g. ‘the queens help the king’ and ‘the king helps the queens’); thus, participants had to make a choice that hinged on interpretation of the critical form (noun case morphology indicating subject/object) when interpreting sentences.
Participants responded via key press and received immediate feedback that remained onscreen for five seconds before the program advanced automatically to the next practice item. Both groups completed the practice session twice. Therefore, both groups were exposed to the same number of Latin exemplars, but the NE+MI group additionally got exposure to metalinguistic information. Time on task was balanced, however, in that both types of feedback stayed onscreen for the same amount of time.
Following guidelines for developing structured input activities (Lee & VanPatten, 2003), both aural and written comprehension-based tasks were included in the treatment. Task 1 presented a written Latin sentence and two English translation choices. Task 2 presented a written sentence and two picture choices. Task 3 presented a picture and two written Latin sentence choices. Task 4 presented an aural sentence and two English translation choices. Task 5 presented an aural sentence and two picture choices. Task 6 presented a picture and an aural sentence, and participants had to decide whether or not the picture shown matched the sentence heard. Participants responded via key presses. Although the order in which the tasks were presented was fixed, item order was randomized within each task.
All participants received feedback on both correct and incorrect responses during the practice session. Thus, the amount of feedback provided was controlled across participants. 8 As outlined above, the two treatments differed in the provision (or not) of metalinguistic information in the feedback. Feedback in the NE+MI condition confirmed or rejected the response and included metalinguistic information about the target form. Feedback in the NE–MI condition confirmed or rejected responses, but did not provide any metalinguistic information 9 . Examples of both types of feedback are provided in Figures 1 and 2. 10

Example of Negative Evidence + Metalinguistic Information (NE+MI) feedback for a correct response.

Example of Negative Evidence – Metalinguistic Information (NE–MI) feedback for a correct response.
c Language tests
Four language tests were administered: (1) a written interpretation test, (2) an aural interpretation test, (3) a written grammaticality judgment test (GJT), and (4) a sentence production test. 11 Three versions of each test, with equivalent but different items, were created, and these were administered as pre-, post-, and delayed tests. The order of test version presentation was counterbalanced across participants and test sessions, with the exception of the sentence production task, which was always completed last in order to minimize test effects from production on comprehension/input-based (i.e. receptive) tests. All language tests included trained (previously seen) and untrained (new) items, and items within each test were presented in randomized order. Participants were asked to respond as quickly and accurately as possible on the tests A third response choice (‘I don’t know’), not included in practice, was included in the tests to minimize artificially inflating scores from guessing on a 2-choice test.
The written and aural interpretation tests followed the same design as Tasks 2 and 5 in the practice session; participants listened to or read a Latin sentence and were instructed to select the corresponding picture (from two choices), or the additional ‘I don’t know’ response. Each test consisted of 20 items (i.e. sentences): 12 critical (6 trained and 6 untrained) and 8 distractors. Whereas the pictures in critical items represented reversed subject/object roles, as in the practice tasks, the pictures in distractors depicted entirely different scenes (different subjects, actions and objects), so that items could be answered using only vocabulary knowledge, without attention to form and meaning of the target structures.
On the GJT, participants read a sentence and indicated whether it was grammatical or not (or ‘I don’t know’) via key press. Like the interpretation tests, this test included 20 items, 12 critical (4 trained and 8 untrained) and 8 distractors. Of the 12 critical items, 6 were grammatical and 6 were ungrammatical. Of the 6 ungrammatical items, 2 had incorrect case endings, 2 had incorrect subject–verb agreement, and 2 contained both of these errors. The distractor sentences contained one noun and a verb rather than 2 nouns and a verb.
On the sentence production test, participants saw a picture on the screen and were asked to form a sentence that correctly described the picture by dragging and dropping the provided noun and verb stems as well as appropriate morphological endings (which they had to select from the complete set of endings provided) in order to form a Latin sentence. To avoid biasing a particular word order, noun and verb stems appeared onscreen in random order. For each production item, two nouns (subject and object) and a verb were required to describe the action in the picture. Of the 15 items on the production test, 10 were critical (5 trained and 5 untrained). As with the GJT, distractor sentences contained one noun and a verb rather than 2 nouns and a verb.
The scoring procedure was straightforward: one point was awarded for each correct answer to the 12 critical items on the interpretation and GJTs, making 12 the maximum score on each of these 3 tests. Each sentence production item was awarded one, two, or three points depending on the number of accurate morpheme choices: one point for correct verb morphology (i.e. correct subject–verb agreement) and one point for each correct noun ending (to score a point for a noun ending, both number and case had to be accurate; half points were not awarded). Thus, the maximum score possible for the sentence production test was 30. According to Cronbach’s alpha values (minimum = .671 to maximum = .870), test reliability was medium to high.
IV Analysis and results
All analyses were run with alpha set at p < .05. Independent analyses were conducted for each of the four language tests, for both accuracy and RT. 12 In addition, we conducted separate analyses on trained and untrained items. Results for accuracy are presented first, followed by analyses of RT data.
1 Accuracy across all items (trained + untrained)
Table 1 summarizes descriptive statistics for accuracy on each test by treatment group (NE+MI, NE–MI). Visual inspection suggests that, overall, both groups improved from pre- to post-test, and that, though the NE–MI group’s immediate gains were more modest, they maintained those gains better than the NE+MI group from the immediate to the delayed post-test.
Descriptive statistics: Overall accuracy.
Notes. WI = written interpretation, AI = aural interpretation, GJ = grammaticality judgment, SP = sentence production. The number of participants was different for each test due to technical problems (some computers froze during post or delayed tests).
A series of 3 × 2 (Time × Condition) mixed repeated-measures ANOVAs were performed on accuracy scores for each language test, with Condition (NE+MI, NE–MI) entered as the between-participants factor, and Time (pre-test, post-test, delayed test) entered as the within-participants factor. In addition, when a significant Time × Condition interaction was present, independent samples t-tests were conducted to compare groups’ performance on post- and delayed tests. When no statistically significant interaction was found, post-hoc contrasts were analysed as a means to explore differences in groups’ immediate learning (performance from pre-test to post-test) and retention (performance from post-test to delayed test).
Independent samples t-tests performed on the pretest scores yielded no differences between the groups for any of the tests: written interpretation, t(55) = .478, p = .635; aural interpretation, t(54) = .334, p = .740; GJT, t(53) = .376, p = .709; and sentence production, t(48) = .093, p = .926. Therefore, any differences between groups across time can be attributed to the treatment.
ANOVA results for overall accuracy are summarized in Table 2. As the table shows, results on all four Latin tests followed an almost identical pattern: significant main effects for Time, Time × Condition interactions, and main effects for Condition on all tests except the GJT.
Summary of results for ANOVA on overall accuracy scores.
Notes. WI = written interpretation, AI = aural interpretation, GJ = grammaticality judgment, SP = sentence production. * Indicates statistical significance at p < .05.
Results of independent samples t-tests showed that the NE+MI group outperformed the NE–MI group at post-test on all four test types (written interpretation, t(55) = 5.19, p < .001; aural interpretation, t(54) = 4.32, p < .001; GJT, t(53) = 2.92, p < .05; sentence production, t(48) = 3.005, p < .05), but this advantage remained at the delayed test only for sentence production (written interpretation, t(55) = 1.92, p = .059; aural interpretation, t(54) = 1.90, p = .063; GJT, t(53) = .011, p = .913; sentence production, t(48) = 2.37, p < .05).
2 Accuracy: Trained vs. untrained items
Next, we conducted separate analyses for trained and untrained items. Descriptive statistics for accuracy on trained and untrained items are presented in Table 3. Again, independent samples t-tests performed on the pretest scores yielded no differences between groups for any test with either trained (written interpretation, t(55) = .187, p = .852, aural interpretation, t(54) = 1.67, p = .101, GJT, t(53) = .37, p = .713, and sentence production, t(48) = .20, p = .845), or untrained items (written interpretation, t(55) = .65, p = .516, aural interpretation test, t(54) = 1.03, p = .305, grammaticality judgment, t(53) = .69, p = .492, and sentence production, t(48) = .38, p = .704). Therefore, any differences between groups across time can be attributed to the treatment.
Descriptive statistics: accuracy for trained and untrained items.
Notes. WI = written interpretation, AI = aural interpretation, GJ = grammaticality judgment, SP = sentence production, T = trained, U = untrained.
a Trained items
Results for trained items mirrored those for overall accuracy as reported above; i.e. the NE+MI group outperformed the NE–MI group at the post-test, but only on sentence production did it maintain its advantage at delayed testing. As results for performance on untrained items revealed different patterns, we report these in more detail below.
b Untrained items
For accuracy on written interpretation, results for untrained items yielded main effects for Time, F(2, 110) = 45.06, p < .001, partial η2 = .45, and Condition, F(1, 55) = 11.68, p < .05, partial η2 = .17, and a significant Time × Condition interaction, F(2, 110) = 11.230, p < .001, partial η2 = .17. As the means in Table 3 suggest, results of the independent samples t-tests conducted on post- and delayed test scores showed that the NE+MI group outperformed the NE–MI group and maintained its advantage two weeks after treatment (post-test, t(55) = 5.14, p < .001; delayed, t(55) = 2.11, p < .05).
For accuracy on untrained items in the aural interpretation test, significant main effects for Time, F(2, 108) = 30.34, p < .001, partial η2 = .36, and Condition, F(1, 54) = 5.33, p = .025, partial η2 = .09 were identified, and are attributed to the NE+MI group’s overall superior performance. There was no significant Time × Condition interaction, F(2, 108) = 1.06, p = .348, partial η2 = .02, power = .23. 13 Statistical contrasts yielded no significant results for any of the test sessions; pre- to post-test, F(1, 54) = 2.17, p = .147, partial η2 = .04, power = .30; post- to delayed test, F(1, 54) = .06, p = .811, partial η2 = .001, power = .06.
For accuracy on untrained items in the GJT, the ANOVA yielded a main effect for Time, F(2, 106) = 13.62, p < .001, partial η2 = .20, but neither a main effect for Condition, F(1, 53) = 2.15, p = .148, partial η2 = .04, power = .30, nor an interaction, F(2, 106) = 2.51, p = .086, partial η2 = .04, power = .30. However, results from contrast analyses for post- to delayed tests, F(1, 53) = 5.67, p < .05, partial η2 = .10, as illustrated in Figure 3, show that the NE–MI group gained from post- to delayed test to the point where it performed similarly to the NE+MI group, whose accuracy actually declined in the two-week interval between the post- and delayed test sessions. These results are confirmed with paired-samples t-tests: Whereas the NE+MI group improved from pre- to post-test t(30) = −4.21, p < .001, and lost significantly from post- to delayed test, t(30) = 2.06, p < .05, the NE–MI group improved significantly from pre- to post-test, t(23) = −3.50, p < .001, and maintained those gains between post- and delayed test, t(23) = −1.37, p = .185.

Grammaticality judgment accuracy means for untrained items by treatment group.
Finally, for accuracy on untrained items in the sentence production test, there was no main effect for either Time, F(2, 96) = 3.02, p = .053, partial η2 = .06, power = .79, or Condition, F(1, 48) = 3.92, p = .053, partial η2 = .08, power = .50, but there was a significant Time × Condition interaction, F(2, 96) = 4.17, p < .05, partial η2 = .08. Results of independent samples t-test conducted on post-, t(48) = 2.85, p < .01, and delayed tests, t(48) = 1.28, p = .205, indicated that the NE+MI group outperformed the NE–MI group on the immediate post-test, but again this between-group difference disappeared after two weeks. Nevertheless, as revealed by paired samples t-tests, whereas the NE+MI group showed performance gains from pre-test to post-test, t(25) = −3.38, p < .001, and maintained these gains on the delayed test, t(25) = 1.49, p = .150, the NE–MI group did not have any immediate gains, t(23) = .39, p = .698, nor did they improve significantly between pre- and delayed tests, t(23) = −.51, p = .618.
3 Speed/RT across all items (trained + untrained)
RT, measured in milliseconds, was calculated as the time interval between the presentation of a test item and a participant’s response via key press. Only RTs for correct responses were entered into the analyses. Table 4 shows the descriptive statistics by treatment group. Overall, the means seem to indicate that both groups responded faster as a result of the treatment. Moreover, both groups appear to have performed similarly on all tests except the GJT, in which the NE–MI group appears to have responded faster on both post- and delayed tests.
Descriptive statistics: Overall RT in seconds.
Notes. WI = written interpretation, AI = aural interpretation, GJ = grammaticality judgment.
As with accuracy data, we performed a series of 3 × 2 (Time × Condition) mixed repeated-measures ANOVAs for RTs on the comprehension-based tests, with separate analyses for all items, trained items only and untrained items only. The results for all items are summarized in Table 5.
Summary of results for ANOVA on overall RT.
Notes. WI = written interpretation, AI = aural interpretation, GJ = grammaticality judgment. * Indicates statistical significance at p < .05.
As Table 5 shows, two of the ANOVAs across all items – for written and aural interpretation RTs –revealed main effects for Time, but no main effect for Condition, and no significant interaction. The main effects for Time indicate that participants responded differently across time. Follow-up contrast analyses (together with the mean values) revealed faster RTs on the immediate post-tests as compared to the pre-tests for these two assessments (written interpretation, F(1, 53) = 18.50, p < .001, partial η2 = .26, and aural interpretation, F(1, 50) = 22.71, p < .001, partial η2 = .31). The pattern that emerges from RT data on the delayed tests is more complex. First, both groups maintained their initial gains in speed on the written interpretation test, as indicated by a non-significant contrast from post-test to delayed test, F(1, 53) = 1.05, p = .310, partial η2 = .02, power = .17. However, they slowed down significantly from post- to delayed test on the aural interpretation test, F(1, 50) = 8.97, p < .001, partial η2 = .15.
Somewhat different results were obtained for the GJT RTs. As shown in Table 5, the analyses again yielded a significant main effect for Time and no main effect for Condition. Contrast analyses (between pre- and post-tests and between post- and delayed tests) revealed that all participants responded faster at post-test, F(1, 49) = 7.40, p < .01, partial η2 = .13, and maintained these RTs at delayed test, F(1, 49) = 1.15, p = .288, partial η2 = .02, power = .18. Note however that, unlike the results for the other tasks, there was a significant Time × Condition interaction. Results of independent samples t-tests conducted on post-, t(49) = 2.46, p < .05, and delayed tests scores t(49) = 2.23, p < .05 revealed a difference in RT between the two groups. This difference was due to reliably faster performance by the NE–MI group, as revealed in the only significant difference, found in results of the paired-samples t-test conducted between pre and post-tests, (NE+MI, t(28) = .55, p = .585; NE–MI, t(23) = 3.74, p < .01) (see also Figure 4).

Grammaticality judgment response time means by treatment group.
4 Speed/RT: Trained vs. untrained items
Descriptive statistics for RTs on trained and untrained items are presented in Table 6. ANOVA results generally mirrored those for overall accuracy reported above, i.e. significant main effects for Time, Time × Condition interactions, and main effects for Condition on all tests except the GJT; with independent samples t-tests revealing that NE+MI performed faster than NE–MI on post-tests. The only exception to this pattern in RTs was for untrained items on the aural interpretation test, with a non-significant main effect for Time, F(2, 102) = 1.82, p = .168, partial η2 = .03, power = .37. Also, there was no significant Time × Condition interaction for RTs from untrained and trained items on the GJT test (trained, F(2, 78) = .70, p = .500, partial η2 = .02, power = .16; untrained, F(2, 88) = 1.36, p = .263, partial η2 = .03, power = .28).
Descriptive statistics: RTs for trained and untrained items in seconds.
Notes. WI = written interpretation, AI = aural interpretation, GJ = grammaticality judgment, T = trained, U = untrained.
To summarize, input-based practice enhanced both accuracy and speed of response irrespective of type of feedback. This was true across all items as well as for trained items on all tests. Results showed that participants who received negative evidence with metalinguistic information (NE+MI) consistently outperformed their NE–MI counterparts on accuracy of interpretation, sentence production, and grammaticality judgment immediately following treatment. After two weeks, the between-group differences in accuracy had largely dissipated, with the NE–MI group performing similarly to the NE+MI group on all measures except the sentence production test. Analyses of RTs revealed that on the GJT the NE–MI group quickened their RTs over time more than the NE+MI group.
Analyses of performance on untrained items revealed a complex view of the effects of feedback conditions on the ability to extend newly gained knowledge to items never seen before. First, the main effects of Time show that irrespective of condition, the treatment was beneficial, as reflected in all participants’ improved performance on the interpretation and GJTs. In addition, in the case of the interpretation tests, the NE+MI group generally outperformed the NE–MI group in accuracy (at both post- and delayed tests for written interpretation, and overall for aural interpretation). On the GJT, the NE–MI and NE+MI groups’ performance was not significantly different across time. For sentence production, the NE+MI group showed an advantage over the NE–MI group at the post-test, but this difference had dissipated by the time of the delayed test. Results show a lack of overall trade-off effects as speed increased with accuracy on all tests except for aural interpretation, for which neither condition seemed to alter participants’ speed of response.
V Discussion
This study investigated the differential effects of negative evidence with or without metalinguistic information on initial language development as reflected in learners’ ability – in terms of accuracy and reaction time – to assign semantic functions to noun phrases. Specifically, we looked at learners’ ability to accurately and efficiently interpret, judge and produce sentences in Latin, a language that relies on noun case (and verb agreement) morphology to convey ‘who does what to whom’. Mindful of the limitations identified in the relatively small number of studies similar to ours, we implemented a design that gave a fair chance to a less explicit treatment group (negative evidence without metalinguistic information) by controlling time on task (amount of practice) and by avoiding testing materials that favor planning and explicit processing. Recall that these factors may have unduly favored more explicit treatments in previous research. Moreover, by conducting separate analyses on accuracy and RT, and on trained and untrained items, we were uniquely able to compare groups’ initial language development in terms of both accuracy and efficiency, as well as their item and system learning. Several interesting conclusions emerge from the results reported above.
In line with DeKeyser’s (2003), Li’s (2010), Norris and Ortega’s (2000), and Spada and Tomita’s (2010) reviews and meta-analyses, as well as studies by Bowles (2005), Nagata (1993), Nagata and Swisher (1995), and Rosa and Leow (2004), our results suggest that participants who received a combination of negative evidence and metalinguistic explanations outperformed those who received only negative evidence on immediate post-tests. However, these results diverge from those of Camblor (2006), Hsieh (2007), Moreno (2007), Sanz (2004), Sanz and Morgan-Short (2004), where there was no difference between conditions with and without metalinguistic explanation in feedback. An explanation for this difference may be found in the processing demands made by the target forms, in learner readiness, and in learner proficiency level.
With regard to the target forms, our participants were required to (1) rely on case morphology over word order and verbal agreement, (2) process eight noun endings (case morphemes) (3) attach them to the right noun (in the case of sentence production), and (4) establish verbal number agreement. This process involves far more than the four forms required in Moreno (2007), Sanz (2004), and Sanz and Morgan-Short (2004), where the choice of target form (Spanish object clitics) was also guided by Competition Model principles (Bates & MacWhinney, 1989). Furthermore, the case morphemes are not perceptually salient, as they are monosyllabic and appear at the end of the word. Moreover, there are multiple forms – feminine, masculine, singular and plural – to encode the same function, accusative or nominative, which contributes to form complexity. This all makes assigning semantic functions one of the most difficult aspects of Latin, as any teacher or learner will attest. The provision of metalinguistic information along with negative evidence may have benefited learners in creating form/meaning connections with such cognitively demanding target forms, as suggested by results in all four measures, at least at the time of the post-test.
It has also been suggested that learner readiness and level of proficiency may influence feedback effects (e.g. Iwashita, 2003; Mackey & Philp, 1998). Nagata (1993), Nagata and Swisher (1995) and Rosa and Leow (2004) examined second-year language learners; other studies (Hsieh, 2007; Moreno, 2007; Sanz & Morgan-Short, 2004) included participants with a basic level of proficiency, while our study looked at absolute beginners. A recent study by Morgan-Short, Sanz, Steinhauer, and Ullman (2010) similarly revealed that naive learners had an initial advantage in performance under a more explicit condition with metalinguistic information, as compared to a less explicit condition without metalinguistic information, in learning an artificial language. Thus, at least with respect to existing laboratory-based research on the effects of more and less explicit instruction on SLA, it appears that providing metalinguistic rules gives at least an initial advantage to naive learners when the processing of linguistic forms is cognitively demanding.
The separate analyses we conducted of untrained and trained items showed an interesting pattern of maintenance and loss of gains depending on type of test and item. First, trained items analysed alone patterned with all items analysed together. In both analyses, on all tasks, the NE+MI group had an initial advantage over the NE–MI; however, this advantage was lost on delayed tests in all cases except for sentence production, due in part to the modest but stable gains made by the group receiving negative evidence only (NE–MI). Note that the sentence production test required transfer of skills from comprehension/input-based practice to the ability to produce the target form. Thus, similar to Stafford et al. (2012), while comprehension-based practice with negative evidence alone was effective, exposure to metalinguistic information in feedback provided an advantage in retention when transfer from input- to output-based skills was involved. Note that, as suggested by an anonymous reviewer, the production test may have favored the use of metalinguistic knowledge, which may explain the maintenance of gains on this test for the group exposed to metalinguistic information.
In order to consider the differential effects of feedback on item learning vs. system learning, we must examine the pattern of results for untrained items only, which would indicate degree of success at system learning. An advantage for metalinguistic information was observed in increased accuracy for untrained items on the interpretation tests (in post and delayed tests in written interpretation, and overall in aural interpretation), which were precisely the tests that most closely resembled the way practice was delivered in the study, i.e. matching one of two pictures with a Latin sentence where the key information is who does what to whom. We interpret these results as evidence that metalinguistic information has a significant, positive effect on the realignment of processing cues and on retention (as evidenced for written interpretation) of the new hierarchy, or ‘learning’ in Competition Model terms. In contrast, both types of feedback positively affected the learners’ ability to assign semantic functions to nouns in untrained items on the GJT. However, in this case, metalinguistic information did not contribute over and above what negative evidence alone contributed to retention of knowledge gained. Finally, metalinguistic information conferred an initial advantage for performance on untrained items in sentence production, but this advantage was not retained over the two-week interval between the immediate and delayed post-tests. We suspect that learners’ cue hierarchy restructuring – towards reliance on verbal agreement and/or on noun case morphology rather than on word order – was not stable enough to allow for transfer to new skills or modalities.
Going back to the main research question guiding the study – i.e. whether negative evidence plus metalinguistic information vs. negative evidence alone included in computer-delivered task-essential practice differentially affects absolute beginners’ ability to assign semantic functions in Latin – the answer is affirmative. Practice with feedback that includes grammatical explanation provides an advantage for system learning defined as cue realignment used to process new items. This advantage was longer-lasting for tests similar to the learning task than for tests that require transfer of skills (i.e. from comprehension/input-based practice to sentence production).
From the more accurate and faster performance observed overall in both groups on interpretation, judgment and sentence production measures, we conclude that task-essential practice and either type of feedback (i.e. negative evidence with or without metalinguistic information) positively affected learners’ short-term ability to assign semantic functions to nouns. These results obtained irrespective of measurement and generally run parallel to those identified in previous literature on pedagogical conditions in SLA (Norris & Ortega, 2000; Spada & Tomita, 2010).
Until recently, the loss of effects in more explicit treatments and maintenance of (more modest) effects in less explicit treatments (e.g. Li, 2010; Norris & Ortega, 2000, but see Spada & Tomita, 2010) has not received much attention, for a number of reasons. Early research was limited to immediate effects of explicit instruction, with implicit or less explicit groups sometimes acting as controls (e.g. Herron, 1991). Also, the ‘surprising’ progress in less explicit groups has sometimes been attributed to a lack of control of external exposure, especially in early classroom research, a suspicion that has no place in our study given its controlled laboratory design. To what, then, can we attribute the apparently more stable gains of the NE–MI group, compared with the losses on some measures by the NE+MI group between immediate and delayed tests? This difference in stability of accuracy gains over time may reflect qualitatively different learning processes at work: more explicit processes in the NE+MI group and more implicit processes in the NE–MI group (Li, 2010). The problem is that immediate post-tests cannot reflect the full extent of implicit learning because implicit learning takes more time than explicit learning and includes a latent phase of experience-triggered memory consolidation following practice (see, for example, Ari-Even Roth, Kishon-Rabin, Hildesheimer, & Karni, 2005). Such consolidation processes would occur subsequent to and thus not be captured by performance on immediate posttests.
Faster performance by the NE–MI group on the GJT also suggests that the two groups may have been engaged in qualitatively different processing that led to quantitatively similar accuracy outcomes, providing further evidence that different types of instruction may have led to different types of processing in this sample. The faster processing of the NE–MI group on the GJT is reminiscent of R. Ellis’ (2005) claim that a timed GJT may be a good measure of implicit knowledge. Faster RTs are usually interpreted as a sign of increased automaticity, whereas slower RTs are taken to index reliance on slower, controlled processes, including monitoring. For example, Sanz, Lin, Lado, Bowden and Stafford (2009) found that participants who were required to verbalize their thought processes while interacting with a grammar lesson were slower at the post-test, which the authors interpreted as showing that verbalizations had affected the quality of the cognitive processes involved in learning for participants in the more explicit condition.
Irrespective of whether increased reaction times index increased automatic processing or decreased monitoring, they are always considered a sign of efficiency when accuracy is maintained and reaction times speed up. Thus, the NE–MI group appears to have been engaged in more efficient, potentially more automatic and less monitored processing of the L3 on the GJT even when accuracy was similar.
Evidence for different cognitive processes underlying similar performance was also found by Morgan-Short et al. (2010), a study that employed an artificial language paradigm to examine whether explicit and implicit training differentially affect neural (electrophysiological) and behavioral (performance) measures of syntactic processing (noun–article and noun–adjective gender agreement). Explicit training conditions in this study included metalinguistic explanations and meaningful examples of the target language, and implicit training conditions provided only meaningful examples. Results showed that at high proficiency (i.e. when participants had completed a certain number of practice blocks), accuracy for the explicitly and implicitly trained groups did not differ. In contrast, electrophysiological (event-related potential, ERPs) measures revealed striking differences between the groups’ neural activity: While explicit training (with metalinguistic information) resulted in some aspects of brain processing found in native speakers, only implicit training (without metalinguistic information) led to a fully native-like neurocognitive pattern.
Overall, our results suggest that exposure to both task-essential practice and either type of feedback (negative evidence with or without metalinguistic information) is beneficial for L2 learning. This is true both in terms of accuracy and speed of processing. Even without metalinguistic information, learning occurs, although it appears to require more exposure to the target language (cf. N. Ellis, 1993) and more time for consolidation (Ari-Even Roth et al., 2005). We speculate that with more practice and/or more time for consolidation, participants exposed to negative evidence only might achieve levels of performance at least comparable and perhaps superior to those obtained by their counterparts, and their gains in achievement would eventually include productive skills.
VI Conclusions and future research
In line with previous studies investigating the effectiveness of negative evidence with or without metalinguistic information, we conclude that both types of feedback appear to lead to more accurate and faster ability to interpret, judge, and produce target sentences for naive learners of Latin learning how to interpret and assign semantic functions to nouns. Providing metalinguistic information gives an initial advantage to naive learners on accuracy when processing the target form is cognitively demanding or when transfer from input- to output-based skills is involved. Two weeks are enough, however, to see most of that advantage disappear, as participants apparently lose much of what they had learned from exposure to metalinguistic information. This is especially evident when processing items that were part of the treatment and could therefore be remembered (and, consequently, be susceptible to forgetting). Importantly, evidence of more stable (though more modest in the short run) gains by learners receiving negative evidence alone suggests that simple right/wrong feedback combined with task-essential, written and aural comprehension-based practice that focuses on connecting form with meaning leads to sustained gains. It may be that receiving negative evidence alone allows learners to engage in more implicit processing that is evidenced in quicker reaction times and that may, in the long run, foster more stable learning than explicit processing.
Limitations to take into account when considering the interpretation of results presented above include the number of items included in each test (12 in comprehension-based and 10 in production) and the size of the partial η2 found for some of the interactions (see Tables 2 and 5). Future research should address these limitations by implementing tests with more items and, whenever possible, by including more than three options per item.
In addition, future research should investigate the differential effects of feedback given other linguistic structures, languages, tasks, and instructional contexts, but we would like to underscore the importance of looking at treatment length/amount of exposure and time elapsed between tests to see if, after a longer period of time and/or additional exposure, learners who are exposed to less explicit feedback continue to improve to eventually outperform those exposed to feedback with explicit, metalinguistic information.
Footnotes
Acknowledgements
We would like to acknowledge the two anonymous reviewers for their constructive comments. In addition, we thank Alison Mackey for her helpful insights and Rusan Chen for his statistical expertise. Any remaining errors are ours.
Funding
This study is part of The Latin Project, developed to investigate the relationship between individual differences and pedagogical variables in language acquisition with support from Georgetown’s GSAS and Spencer Foundation grants to Sanz, as well as assistance from Bill Garr and RuSan Chen of Georgetown’s UIS/CNDLS. Lado conducted the experiment with materials developed by Sanz, Bowden, and Stafford, analyzed the data, and wrote the manuscript with Sanz, with subsequent extensive review by Bowden and Stafford.
