Abstract
Most second language researchers agree that there is a role for corrective feedback in second language writing classes. However, many unanswered questions remain concerning the linguistic features to target and the type and amount of feedback to offer. This study examined essays by 151 learners of English as a second language (ESL), in order to investigate the effect of either direct or metalinguistic written feedback on errors with the simple past tense and the present perfect tense. This inquiry also considered the extent to which learner differences in language-analytic ability (LAA), as measured by the LLAMA F, mediated the effects of these two types of explicit written corrective feedback. Learners in both feedback groups were provided with corrective feedback on two essays whereas the control group received general comments on content. Learners in all three groups then completed two additional writing tasks to determine whether or not the provision of corrective feedback led to greater gains in accuracy compared to no feedback. Both treatment groups performed better than the comparison group on new pieces of writing immediately following the treatment sessions, yet direct feedback was more durable than metalinguistic feedback for one structure, the simple past tense. Participants with greater LAA proved more likely to achieve gains in the direct feedback group than in the metalinguistic group, whereas learners with lower LAA benefited more from metalinguistic feedback.
Keywords
I Introduction
From an acquisition perspective, a flaw found in many early written corrective feedback (CF) studies was that they did not consider the effects of feedback on new pieces of writing, but only whether or not feedback helped students to achieve greater accuracy on a second draft (e.g. Ashwell, 2000; Ferris & Roberts, 2001). Later investigations, however, have attempted to answer the question of whether or not error correction will ultimately lead to learning and greater accuracy in second language (L2) language development by following research designs similar to those used to investigate the various forms of oral error correction. The results of these more recent studies (e.g. Bitchener, 2008; Bitchener & Knoch, 2010a, 2010b; Ellis, Sheen, Murakami, & Takashima, 2008; Sheen, 2007, 2010; Sheen, Wright, & Moldawa, 2009; Shintani, Ellis, & Suzuki, 2014; Stefanou & Révész, 2015; Van Beuningen, de Jong, & Kuiken, 2012) provide consistent evidence that written CF can help learners acquire grammatical features and thus present a serious challenge to Truscott’s (1996) claim that error correction is ineffective and undesirable. Most L2 researchers agree, then, that there is indeed a role for corrective feedback in L2 writing classes. However, many unanswered questions remain concerning the linguistic features to target and the type and amount of feedback to offer.
The present study aimed to contribute to this research agenda by examining the differential effects of direct and metalinguistic written feedback on errors with the simple past tense and the present perfect tense. Since several of the written CF studies conducted since the 1990s have focused on the same linguistic target, English articles, the current study was also designed to determine whether or not focused direct and metalinguistic feedback would be effective for treating errors with different structures. 1 One hundred and sixty-five adult learners of English as a second language (ESL) were matched in sets based on first language (L1) and scores from a test of grammatical inferencing, and then randomly assigned to one of three groups: direct feedback, metalinguistic feedback, or control. Learners in the feedback groups were provided with CF on two essays, after which learners in all three groups completed two additional writing tasks to determine whether or not the provision of CF led to greater gains in accuracy compared to no feedback. By including a test of grammatical inferencing, this investigation also considered the extent to which learner differences in language-analytic ability (LAA) mediated the effects of these two types of explicit written corrective feedback.
1 Literature background
Written CF is explicit in the sense that it informs the learner that he or she has made an error. In some cases, the correct form is given to the learner, generally termed ‘direct feedback’, whereas in other cases, the error is simply pointed out and is considered ‘indirect’. In yet other cases, the correct form is given directly, along with metalinguistic information or grammar rules. In addition, decisions must be made on a continuum between narrowly focused (intensive correction of one or a limited number of errors) or unfocused (comprehensive) CF.
A number of recent studies have investigated the effectiveness of different types of focused written CF and have obtained mixed results. Bitchener, Young, and Cameron (2005) and Sheen (2007) found an advantage for direct error correction that included metalinguistic-information; Shintani and Ellis (2013) found an advantage for metalinguistic explanation; and Shintani et al. (2014) showed an advantage for direct error correction; other studies, however, such as Bitchener (2008), Bitchener and Knoch (2008; 2010a), and Stefanou and Révész (2015), found little or no difference between direct error correction only and direct error correction that included metalinguistic information.
Arguments in favor of direct forms of feedback suggest that it reduces confusion and provides information to sort out more complex (e.g. syntactic) errors (Bitchener & Ferris, 2012). The state of a learner’s knowledge, however, may be a more critical factor in determining the effectiveness of different forms of feedback. If a learner doesn’t have clear declarative knowledge, then direct examples of form in conjunction with metalinguistic explanation might be required. On the other hand, if the learner has solid declarative knowledge, direct feedback that only supplies the correct form or even indirect feedback might suffice.
The complexity of the structure also plays a role in the effectiveness of different forms of written feedback. Ferris suggests that errors with verb tense and form are treatable since they occur in a ‘patterned, rule-governed way’ (1999, p. 6), and research on oral CF as well as studies on written CF lend support to this claim (e.g. Bitchener et al., 2005; Yang & Lyster, 2010). For instance, Frear (2012) found that written CF improved learners’ use of the regular past tense, but not the irregular past tense. This finding supports the idea that structures for which there are patterned rules (regular past tense) are more treatable than for those for which there are no clear patterns. Other factors, however, such as the salience of a feature and its relevance for meaning, also play a role in error treatability. As put forth by DeKeyser (2005, 2016), the complexity of an L2 structure and the difficulty it presents for learners can stem from a combination of factors including abstractness or novelty in meaning, the number of choices involved in choosing the correct morphemes for each form, and lack of transparency in form-meaning mapping. Thus, errors with verb tense choice (e.g. simple present vs. simple past) may be more treatable than errors which involve an understanding of aspect, which requires more novel distinctions for most learners (e.g. Ayoun, 2001, 2004; Ishida, 2004).
2 Individual differences
The increase in accuracy that appears in new pieces of writing after the provision of written CF suggests that learning has taken place, but great individual variation often exists among learners. Individual difference variables have been identified but have not been sufficiently explored for their potential in determining the effectiveness of various types of written CF. In an ESL setting, differences in proficiency level, L1, motivation, and aptitude likely play a role in the effectiveness of different types of CF and may explain some of the contradictions and inconsistencies in the literature.
Aptitude is an important construct when considering CF in second language acquisition (SLA). It is generally agreed that LAA may predict the ‘rate of progress’ for individual learners (Dörnyei & Skehan, 2003; Dörnyei, 2005). Evidence furthermore indicates that learners with higher LAA will have greater success when making the cognitive comparisons that are required for CF to result in learning (e.g. DeKeyser, 1993; Sheen, 2007; Shintani & Ellis, 2015). Overall, studies have found a strong correlation between more implicit or inductive instructional conditions and higher language aptitude (e.g. Erlam, 2005; Robinson, 1997).
Dörnyei and Skehan (2003) suggest that there are also individual differences in ‘noticing abilities’ and ‘pattern-extraction capacities’ that allow some individuals to analyse and make generalizations more effectively than others (p. 599). A memory-oriented, less analytic learner, for example, might do better with corrective feedback chunks where only the correct form is provided in lieu of a more analytic form of CF where the rule is provided.
Concerning written CF, only three studies to date have considered LAA. Sheen (2007) found that a higher level of LAA correlated with the extent to which students benefited from both types of written CF (direct correction and direct with metalinguistic) tested in her study, which targeted one linguistic form, articles. She also found that learners with lower LAA were less likely to benefit from metalinguistic feedback.
Shintani and Ellis (2015) examined the role of LAA with different kinds of written feedback for two different structures, the past hypothetical conditional and indefinite articles. Like Sheen, Shintani and Ellis found that learners with stronger LAA benefited more from both types of feedback than learners with weaker ability. Their results do not lend clear support to the claim that LAA will play a stronger role when the feedback is metalinguistic in nature.
Finally, Stefanou and Révész (2015) explored two aspects of English article use, generic and specific plural reference; this study also investigated individual differences in LAA and further included a measure of knowledge of meta-language. In contrast to Sheen (2007), participants with greater grammatical sensitivity and knowledge of metalanguage were more likely to achieve gains in the direct feedback-only group. There was no link found between individual differences and the gains made by the group who received direct feedback with metalinguistic comments.
Thus, the role that LAA plays in written CF is unclear. Recently, SLA research has begun to consider individual subcomponents of language aptitude (e.g. DeKeyser & Koeth, 2011; Dörnyei, 2005; Kormos, 2013) and how these language aptitude components can be matched to instruction to optimize language learning (e.g. Doughty, 2013; Vatz, Tare, Jackson, & Doughty, 2013). Yet, carefully controlled experiments on written CF that systematically explore various individual difference factors are relatively few. The research reported here has attempted to offer some additional clues to the written CF puzzle by investigating the effects of written CF for learners of different levels of proficiency and to determine the effects of one individual difference, language aptitude.
II Method
1 Research questions
In addition to investigating different types of written CF and different categories of linguistic error, this study considered the interaction of LAA with different forms of CF. The research was designed to address the following research questions:
Research question 1: Does direct corrective feedback (DF) lead to more accurate use of the targeted structures than writing with no feedback?
Research question 2: Does metalinguistic feedback (MF) lead to more accurate use of the targeted structures than writing with no feedback?
Research question 3: Is there any difference in the effect of DF and MF on learners’ accurate use of the simple past tense and the present perfect tense?
Research question 4: To what extent does the learners’ language-analytic ability determine the effectiveness of DF or MF?
2 Setting and participants
The participants for this study were recruited from 14 English for academic purposes (EAP) classes (low-intermediate to advanced levels 2 ) at a large state college in the USA. One hundred and sixty-five participants consented to participate in the study, and one hundred and fifty-one participants completed all parts of the data collection (administered in six sessions). There were 27 different L1s, with substantial groups of native Spanish speakers (n = 59), Vietnamese speakers (n = 28), and Arabic speakers (n = 23).
3 Target structures
To date, the majority of the focused feedback studies (e.g. Bitchener, 2008; Bitchener & Knoch, 2008, 2010a, 2010b; Ellis et al., 2008; Sheen, 2007, 2010; Sheen et al., 2009) have investigated specific functions of the English article system and found that groups who received corrective feedback outperformed the control group for specific uses of English articles. Very few written CF studies have targeted other linguistic error domains and categories. One exception is the study by Bitchener et al. (2005), which investigated not only the use of the English article system, but also the simple past tense and prepositions. These researchers found that written CF helped learners to improve their accuracy with specific uses of articles and verb tense, but not with the more idiosyncratic uses of prepositions. Another exception is Shintani et al. (2014), who found that written CF was effective for improving accuracy with a syntactic structure, the hypothetical conditional. There remain, however, many structures that have received little or no focused research attention.
The current research, therefore, aimed to examine the effectiveness of different forms of corrective feedback on errors with two verb tenses, simple past and present perfect. Such errors are often persistent regardless of the learner’s L1 and proficiency level. Comprehending the semantics of where the past tense ends and the present perfect begins is particularly tricky for ESL students (for discussion, see Celce-Murcia & Larsen-Freeman, 1999, pp. 113–116).
4 Treatment
CF was provided electronically for errors that occurred during two in-class writing tasks with simple past and present perfect. Course instructors were asked to refrain from explicit teaching of the target structures during the course of the study. Learners were corrected when they made an error with simple past and present perfect verb forms (e.g. The man has sleeped in the street) and when they made an error with choice of verb tense (e.g. I saw a lot of movies lately), but only in cases where either simple past or present perfect was required.
The type of CF given differed for the two treatment groups. The DF group received the traditional strategy that consists of marking the error on the student’s essay and indicating the correct form. The MF group likewise had their errors marked; this group received consistently worded metalinguistic comments (e.g. brief grammar rules provided via comments on the students’ Microsoft Word documents) but did not receive the correct form (for examples, see Appendix 1). The control group received general comments on content and organization, but did not receive any error correction about grammar.
5 Procedures
The current study employed a pretest-treatment-posttest-delayed posttest experimental design. One week prior to the first writing task, the students were given the LLAMA F, a sub-set of the LLAMA computer-based aptitude test (Meara, 2005), as a measure of LAA. The LLAMA F specifically measures grammatical inferencing or the ability to induce the rules of an unknown language (e.g. explicit inductive learning). At this time, participants were also given a form-focused pretest to measure prior knowledge of the target structures (see Appendix 2). Participants were matched in sets based on LLAMA F scores and L1, after which members of each set were randomly assigned to either one of the two treatment groups (DF or MF) or the control group.
The following week, during one hour of class time, all three groups were asked to write an essay (Essay 1). The writing prompts used for the study were comparable and based on the rhetorical modes taught within the EAP program. The order of the prompts (topics) was counterbalanced over individuals to control for differences in writing task difficulty. The students used a word processor for the writing task and were instructed to write at least two pages using Times New Roman 12-point text. The directions were the same for all three groups, and all participants were instructed by the researcher to pay attention to accuracy (for writing prompts, see Appendix 3). The students submitted their essays to a drop box within the learning management system (LMS) used by the college. The researcher provided all of the feedback via the LMS, which allowed for systematic and clear mark-up. The essays produced by the students were corrected within one week. The researcher then returned to each classroom the following week (Week 3) and returned the essays with feedback to the students.
The students were given 30 minutes in class to revise their essays using the feedback. The revised essays were then collected. Learners were not given feedback on the quality of their revisions. The following week (Week 4), students wrote a second essay in class (Essay 2). The second essay for all groups was likewise corrected and returned one week later (Week 5), at which time the students were again given 30 minutes to revise. After a break, but immediately following this revision session, learners were asked to write on a new topic (Essay 3, which served as the immediate posttest). An unannounced delayed written posttest (Essay 4) was conducted four weeks later (Week 9), at which time the students were asked once again to write on a new topic; see Table 1.
Experimental procedure.
6 Operationalization of variables
All written output produced by the participants was coded for correct use of each targeted linguistic feature. Specifically, for each essay, all obligatory instances of the simple past tense were tallied. If the simple past tense was not correctly provided in one of these obligatory contexts, this was counted as an error in use of the simple past tense. The same procedure was followed for the present perfect tense.
The coding process was blind, in that the researcher coded without knowledge of which group each text belonged to. To examine the reliability of the scoring, 76 texts (over 12 percent of the 604 texts) were rescored by a second trained teacher/researcher. The overall agreement rate between the two scorers (both native speakers of English) was high for both structures as determined by the intra-class correlation coefficient of .95 for the simple past tense and .89 for the present perfect tense.
III Results
1 Descriptive statistics for LLAMA F
LLAMA F scores ranged from 0 to 90 for each group. LLAMA F means for the three groups were compared using a one-way ANOVA. The mean for the DF group was 32.5 (SD = 25.35); the mean for the MF group was 31.6 (SD = 24.77), and the mean for the control group was 29.8 (SD = 25.21). As expected, since participants were matched in sets based on LLAMA F scores and L1 before being randomly assigned, no significant difference was found between the three groups (F(2, 148) = .151 p > .05).
2 Form-focused grammar test
A reliability estimate for the form-focused pretest was calculated using the data from the 151 participants who completed all data collection sessions. 3 Cronbach’s alpha for the pretest was .831. A series of one-way ANOVAs were conducted to compare group means for each section of the test. For the verb tense section of the pretest, the mean for the DF group was 7.67 (SD = 2.07); the mean for the MF group was 7.90 (SD = 1.89), and the mean for the control group was 7.65 (SD = 1.80). No significant differences were found between groups (F(2, 148) = .253 p > .05).
3 Simple past tense in the writing tasks
Simple past tense scores were analysed to examine the effect of feedback type on accurate use of this verb tense in subsequent writing tasks. A total of 8,106 observations from 139 participants 4 were included in the analysis (for a breakdown by group and essay time, see Table 2). The percentage correct for simple past tense for each group at each essay time is given in Table 3, and the number of correct and incorrect trials with the simple past tense at each essay time for each group are visually presented in Figure 1.
Total number of obligatory contexts for simple past tense at each essay time.
Percentage correct for simple past tense at each essay time.

Correct and incorrect trials with simple past tense for each group at each essay time.
Written corrective feedback studies have reported (e.g. Ferris, Liu, Sinha, & Senna, 2013) that there is great variation between individuals in their response to feedback, and certainly in the present study the participants, who came from different levels of EAP courses, varied as to their fluency, accuracy, and ability to improve over time. Moreover, the number of trials or attempts to use the simple past tense also varied not only between participants but from essay to essay for each learner. Thus, the percentage correct for each structure is based on a different number of trials for each participant. All data were analysed using mixed-effects logit models in R (R Development Core Team, 2011) with the R package lme4 (Bates, Maechler, & Bolker, 2011), as mixed-effects logit models allow every observation to be modeled while accounting for the fact that there are many observations for each participant (Jaeger, 2008).
To examine the effect of group on accurate use of the simple past tense, a mixed-effects logit model was fit with group, essay time, and the interaction between group and essay time as fixed effects, and participant as a random effect. In order for the model to converge with the above interaction terms, it was necessary to eliminate one of the essay times from the model. For both treatment groups there were sizable gains in accuracy between essay time one and time two, after one round of feedback but, in fact, little change between time two and time three. Since the purpose of the study was to determine if two rounds of feedback would lead to improvements on a third writing task, the reported model includes only the pretest (Essay 1), the immediate posttest (Essay 3), and the delayed posttest (Essay 4).
Additional models, which included standardized LLAMA F scores and standardized form-focused pretest scores and interactions with these effects, were fit and compared. The best model, presented in Table 4, includes form-focused pretest scores, LLAMA F scores, group, essay time, and the interaction between group and essay time, as well as interactions between LLAMA F and group, LLAMA F and essay time, and LLAMA F, group, and essay time, as fixed effects. The model was likewise fit with random effects for participant, including the random intercept and random slopes for essay time since the results of a comparison between models showed that random slopes significantly improved model fit, χ2 (6) = 84.091, p < 0.001.
Result summary of coefficient estimates β, standard errors SE, associated Wald’s z-score, and significance level p for the fixed and random effects in the simple past tense analysis.
Note. * p < .05; ** p < .01; *** p < .001.
The model reported in Table 4 was used to assess several paired comparisons. At Essay Time 1, which served as the pretest (Essay 1), the DF group (69%) was significantly less accurate than the MF group (80%, β = −0.59, SE = 0.27, p = .026), and marginally lower than the control group (78%, β = −0.47, SE = 0.26, p = .075). The MF and control groups, however, were not significantly different from each other (p = .639). Since the groups began at different levels of predicted accuracy, the most important results concern the gains made by each group.
The results indicate that there was a significant interaction between essay time and group. The DF group showed a significant increase in probability of a correct response from the time of Essay 1 (69%) to the time of Essay 3 (84%, β = .87, SE = 0.20, p < .001). As shown in Table 5 and Figure 2, these gains were maintained at the time of the delayed posttest (82%, β = .77, SE = 0.19, p < .001). The model also indicates that these gains in accuracy achieved by the DF group were significantly different compared to changes seen at Time 3 (β = −0.79, SE = 0.28, p < .005) and Time 4 (β = −0.88, SE = 0.27, p < .001) in the control group.
Predicted probabilities (log odds with percentage correct in parentheses) of accurate use of simple past tense by group at each essay time.

Plot of predicted probabilities (log odds) of accurate use of simple past tense by group at each essay time.
The MF group likewise made significant gains in accuracy from the time of Essay 1 (80%) to the time of the Essay 3 (88%, β = .63, SE = 0.22, p = .005). The model further indicates, however, that this change in accuracy was only marginally different from changes seen in the control group (β = −0.55, SE = 0.30, p < .066). Furthermore, the MF group lost these gains by the time of the delayed posttest (81%, β = .07, SE = 0.20, p = .718).
The control group, however, exhibited no significant changes in accuracy between the time of Essay 1 (78%), Essay 3 (79%, β = .08, SE = 0.21, p = .708), and Essay 4 (78%, β = −0.12, SE = 0.19, p = .546). For the DF group, there was also a significant two-way interaction between LLAMA F and Time, in particular a positive change in trial-level accuracy from Time 1 to Time 3 (β = .42, SE = .20, p = .035). Likewise, for the MF group, there was a significant two-way interaction between LLAMA F and Time, which reflects a reduction in accuracy at Time 3 (β = −0.60, SE = .23, p = .009). The two-way interaction for the control group between LLAMA F and Time (at Time 3) was not significant (β = −0.18, SE = .21, p = .387).
Significant three-way interactions between essay time, LLAMA, and group confirm that the effect of LLAMA seen by the direct group at Time 3 is significantly different from the effect of LLAMA seen in the metalinguistic group (β = −1.02, SE = .31, p = .001), and significantly different from the effect of LLAMA seen in the control group (β = −.60, SE = .29, p = .038). There was no three-way interaction, however, between Time, LLAMA, and metalinguistic vs. control groups (β = .42, SE = .31, p = .179) at Time 3, meaning that the effect of LLAMA seen in the metalinguistic group may not be different from what is seen in the control group. There were no significant three-way interactions at Time 4 (delayed posttest).
These results suggest a possible aptitude-treatment interaction (ATI). In order to test the differential effect of treatment depending on aptitude, additional models were fit with standardized LLAMA F scores that were shifted two standard deviations above and below the mean to create new variables to allow for comparisons between the effects of these two types of feedback on learners with higher and lower levels of language aptitude.
Learners with higher LLAMA F scores began with 57% predicted accuracy in the DF group, and learners in the MF group began with 88% accuracy. This difference in accuracy between groups at the start of the study was significant (β = 1.73, SE = .61, p = .005). Learners with higher LLAMA F scores in the DF group significantly increased their rate of accuracy at Time 3 (88%, β = 1.708, SE = .45, p = .000), and although somewhat lower, these learners’ gains were still significant at Time 4 (78%, β = .973, SE = .41, p = .018). In contrast, learners with higher LLAMA F scores in the MF group began with 88% accuracy, yet these learners’ accuracy decreased to 81% at Time 3 (β = −0.580, SE = .51, p = .254) and to 85% at Time 4 (β = −0.252, SE = .46, p = .584). Neither decrease, however, was significant. For learners with higher LLAMA F scores, the difference in gains between the DF and MF groups at Time 3 (β = −2.288, SE = .68, p = .001) and Time 4 (β = −1.225, SE = .62, p = .048) was significant. These results suggest that learners with a higher LAA would benefit more from direct feedback than from metalinguistic feedback.
On the other hand, learners with lower LLAMA F scores were predicted by the model to begin with 78% accuracy in the DF group and 68% accuracy in the MF group. This initial difference in accuracy was not significant (β = −0.543, SE = .61, p = .374). Learners with lower LLAMA scores in the DF group made slight gains in accuracy from Time 1 (78%) to Time 3 (79%, β = .031, SE = 0.45, p = .945). These gains were not significant. This group increased again slightly at Time 4, but the gain was still not significant (86%, β = .558, SE = 0.419, p = .182). In contrast, in the MF group these learners were predicted to benefit from metalinguistic feedback. They significantly improved from Time 1 (68%) to Time 3 (93%, β = 1.83, SE = .52, p = .000); however, their accuracy rate decreased to 76% at Time 4 (β = .393, SE = .45, p = .380). The difference in accuracy rate between Time 1 and Time 4 was not significant.
The difference in gains at Time 3 between the MF and DF groups was significant (β = 1.799, SE = .68, p = .009). However, the difference in gains between these two groups at Time 4 was not significant (β = −0.165, SE = .61, p = .787). Thus, at the time of the immediate written posttest (Time 3), this model suggests that learners with lower LAA benefitted more from metalinguistic feedback than direct feedback.
In summary, these comparisons imply that learners with lower LAA will be more successful with metalinguistic feedback, whereas learners with higher LAA will benefit more from direct feedback, at least for the simple past.
4 Present perfect tense in the writing tasks
All 151 participants made at least a few errors with the present perfect tense. However, seven participants, who deviated from the writing prompts and avoided using the perfect tense on two or more essays, were deleted from the data set since they did not receive two rounds of feedback on the target structures.
A total of 2,480 observations from 144 participants were included in the analysis for the present perfect tense. A breakdown of the number of observations at each essay time for each group is presented in Table 6. Percentage correct for the present perfect tense for each essay and each group are given in Table 7, and the number of correct and incorrect uses of present perfect tense are visually presented in Figure 3.
Total number of obligatory contexts for present perfect at each essay time.
Percentage correct for present perfect for each essay time for each group.

Correct and incorrect trials with present perfect tense for each group at each essay time.
To determine whether or not there was a main effect for group on accurate use of the present perfect tense, mixed-effects logit models which included standardized LLAMA F scores, standardized form-focused pretest scores, and interactions with these effects, were fit and compared. For this structure, no significant interaction with LLAMA F score was found, and LLAMA F score alone did not improve model fit. The best model, presented in Table 8, includes form-focused pretest scores, group, essay time, the interaction between group and essay time as fixed effects. The model was also fit with random effects of participant, including the random intercept and random slopes for the effects of essay time. The results of a comparison between models showed that random slopes significantly improved model fit, χ2 (6) = 29.61, p < .001.
Result summary of coefficient estimates β, standard errors SE, associated Wald’s z-score, and significance level p for fixed and random effects in the present perfect tense analysis.
Note. * p < .05; ** p < .01; *** p < .001.
There were no significant differences between groups at the start of the study. As shown in Table 9 and Figure 4, learners who received direct feedback made significant gains from Essay 1 (30%) to the time of the immediate posttest, Essay 3 (51%, β = .88, SE = 0.31, p = .004). These gains, however, were not maintained at the time of the delayed posttest, Essay 4 (28%, β = −0.09, SE = 0.34, p = .779).
Predicted probabilities (log odds with percentage correct in parentheses) of accurate use of the present perfect tense.

Plot of predicted probabilities of accurate use of present perfect tense by group at each essay time.
The MF group also made significant gains in accuracy from Essay 1 (35%) to the time of the immediate posttest, Essay 3 (55%, β = .83, SE = 0.30, p =.006), but was not able to maintain these gains by the time of the delayed posttest (Essay 4) either (36%, β = 0.05, SE = 0.32, p = .888). This model further confirms that the gains seen at Time 3 by the DF group are significantly different from those of the control group (β = 1.39, SE = .44, p = .002), and that likewise the gains achieved by the MF group at Time 3 were significant from those of the control group (β = 1.33, SE = .44, p = .002).
The control group showed no significant changes in accuracy between the time of the pretest, Essay 1 (42%), the posttest, Essay 3 (31%, β = −.50, SE = 0.32, p = .118), and the delayed posttest, Essay 4 (35%, β = −.30, SE = 0.32, p = .342).
IV Discussion
1 The effects of feedback on accuracy in subsequent writing tasks
The first two research questions concerned whether or not the provision of direct or metalinguistic feedback would lead to more accuracy on a new text than no feedback. At the time of the immediate posttest, both treatment groups showed progress with the two verb tenses compared to the control group, who did not show improvement with either structure. Increases in predicted accuracy were significant for both treatment groups for the simple past tense and the present perfect tense.
As far as long-term gains, the results are mixed. Only learners who received DF on the simple past maintained significant gains compared to the control group at the time of the delayed posttest. In contrast, learners who received MF did not retain predicted improvements on the simple past tense by the time of the delayed posttest. For the present perfect tense, neither treatment group maintained the gains shown on the immediate posttest when asked to write again four weeks later.
It may be the case that long-term gains were not attainable after only two rounds of feedback since learners had to give their attention to several errors simultaneously, depending on their L1 and acquisition stage. Learners who are instructed can, however, progress at a faster rate and can progress further along developmental sequences than naturalistic learners (Long, 1983). For example, without instruction many naturalistic learners never produce ‘-ed’ or produce it with low levels of accuracy in comparison to those who receive instruction (e.g. Bardovi-Harlig, 1995). Bitchener and Knoch (2010a) conclude that determining which errors should be treated is not an impossible task for individual or small group contexts, such as those often found within second-language writing courses offered at colleges or universities, since learners and teachers can work together to identify and target reoccurring and problematic forms.
On the whole, the present study, in contrast to Truscott’s claims (e.g. 1996, 1999, 2001, 2007), corroborates the findings of other written corrective feedback studies (e.g. Bitchener, 2008; Bitchener & Knoch, 2008; Sheen 2007, 2010; Stefanou & Révész, 2015; Van Beuningen et al., 2012) that feedback on grammatical error can lead to learning and improved accuracy on subsequent writing tasks.
The third research question focused on whether or not there were any differences in the effect of DF and MF on learners’ accurate use of the simple past tense and the present perfect tense. Very few studies have considered DF and MF as individual predictors to see what merits each form of feedback has on its own. One of the goals of the current study, therefore, was to examine these two forms of feedback separately. Of recent controlled studies, only Bitchener and Knoch (2010b), Diab (2015), Shintani and Ellis (2013), and Shintani et al. (2014) have considered the benefits of metalinguistic information on its own without the provision of the correct form. The results of these studies have suggested that MF by itself can be as beneficial as DF at least in the short term. Shintani and Ellis (2013) found short-term benefits for metalinguistic information on a new piece of writing shortly after the provision of feedback on English articles. Similarly, Shintani et al. (2014) found no difference between direct and metalinguistic information on a new writing task that occurred immediately following feedback for treating the hypothetical conditional. However, only Bitchener and Knoch (2010b) found long-term benefits for metalinguistic feedback. In contrast, several studies have found long-term benefits for direct feedback (e.g. Bitchener, 2008; Bitchener & Knoch, 2008; Bitchener & Knoch, 2010a; Shintani et al., 2014).
Therefore, both types of feedback were expected to result in gains in accuracy for all of the target structures, at least in the short-term, but overall, direct feedback was expected to produce more durable gains. In the present study, MF was not as durable as DF for the simple past tense. Learners who received metalinguistic comments made errors with the simple past tense 20% of the time at the start of the study. By the time of the immediate written posttest, this group had significantly reduced its error rate to 12%. Yet, by the time of the delayed posttest, the metalinguistic group was essentially at the same rate of error predicted at the start of the semester. This result agrees with the findings of Diab (2015) for pronoun agreement errors, Shintani and Ellis (2013) for errors with indefinite articles, and Shintani et al. (2014) for errors with the hypothetical conditional, all of whom found that MF on its own resulted in significant gains in accuracy at the time of the immediate posttest, but that these improvements were lost by the time of the delayed posttest. In the present study, for the simple past tense, DF was more durable than MF and therefore considered more effective. Shintani et al. (2014) propose that DF offers learners positive evidence that allows them to confirm or reject their own attempts to use a structure, whereas metalinguistic information requires learners to apply an abstract set of rules to whatever errors they have produced. Learners in the present study likely had at least some knowledge of the rules for the simple past tense which potentially made the DF more beneficial than the MF for this structure.
For the present perfect tense, DF was not superior to MF. Both the DF group and the MF group had significantly improved by the third writing task, which occurred after two rounds of feedback and revision. As expected, for the present perfect tense, the error rate was much higher at the start of the study for all learners compared to the error rate seen with the simple past. Learners in the DF group made errors with the perfect tense 70% of the time. By the time of the immediate posttest, this group was able to reduce its predicted error rate to 49%. By the time of the delayed posttest, however, the DF group’s error rate was 72% (see Table 9). Similarly, learners who received MF made errors with the present perfect tense 65% of the time at the start of the study. This treatment group significantly reduced its error rate to 55% at the time of the immediate posttest. Like those who received DF, however, this group made errors 64% of the time four weeks later. In summary, for the present perfect tense, both types of feedback enabled learners to significantly improve their accuracy on a new piece of writing, but neither group maintained this improvement long-term.
The fact that direct feedback was not superior and effective for long-term gains for the perfect tense might be due to a lack of declarative knowledge. For the simple past the largest gains occurred between Essay 1 and Essay 2. There was little change between Essay 2 and Essay 3. Alternatively, for the present perfect tense, the largest gains appeared after two rounds of feedback. Many learners may not have been ready to learn the perfect tense structure, whereas learners at all levels had likely already encountered the rules for the simple past and therefore had declarative knowledge, which benefitted from production practice and explicit direct feedback, in the sense that these allowed them to confirm or reject their hypotheses concerning this structure.
2 Interactions between feedback type and language-analytic ability
The last research question considered the differential effect of LAA with different forms of feedback. Based on Sheen (2007), 5 who found a differential effect of aptitude for different forms of feedback, learners with lower LAA were not expected to benefit from metalinguistic feedback. Surprisingly, however, in the current study, learners with a higher LAA in the metalinguistic group did not have greater overall gains in accuracy with the simple past tense. At the time of the immediate posttest, learners with LLAMA F scores one standard deviation above the mean were predicted to receive essentially no benefit from the MF compared to learners with an average LLAMA F score who significantly benefitted from rule provision. That there was no link found between individual differences in LAA and the gains made by the group who received MF agrees with the findings of Stefanou and Révész (2015), who only found a link between grammatical sensitivity and DF, but contradicts the findings of Sheen (2007) who found a positive correlation between the provision of MF and LAA. Shintani and Ellis (2015) found contrasting results with two different structures, the past hypothetical conditional and indefinite articles in their study, which compared DF to metalinguistic explanation on its own. Like in Sheen (2007), learners with greater LAA benefited more from both types of feedback than learners with weaker ability. However, in contrast to Sheen (2007), and as was found in the current study, the mediating effect of LAA in their study was only evident in new writing produced shortly after the feedback. Shintani and Ellis suggest that the extent to which LAA plays a role involves a complex interaction between type of feedback, opportunity to revise, and the target structure.
Understanding the interactions between feedback type and individual differences, nevertheless, is tricky. Exact comparisons between the current studies that have examined the role of LAA in feedback effectiveness cannot be made. First, different measures of language aptitude were used in each study. Sheen’s (2007) measure of aptitude was based on a language analysis test developed by Ottó and used by Schmitt, Dörnyei, Adolphs, and Durow (2003). Shintani and Ellis (2015) used the LAA section of the Language Aptitude Battery for Japanese (Sasaki, 1996), a Japanese translation of Pimsleur’s Language Aptitude Battery. Stefanou and Révész (2015) employed a Words in Sentences Test (adapted from the MLAT) and a test of metalanguage to measure grammatical sensitivity and knowledge of metalanguage respectively, and the current study used the LLAMA F. The aptitude measures used by Sheen (2007) and Shintani and Ellis (2015) required learners to induce the rules of an artificial language, but in Sheen’s study learners had to determine the rules using English, their L2, whereas learners in the latter study could contemplate the rules using their L1. Moreover, the aptitude measure used by Stefanou and Révész required learners to understand the functions of words in sentences. Thus, the aptitudes being measured in these studies in relation to the treatments are not exactly the same, making the comparison of ATIs unclear.
Second, predictors were not operationalized in the same way across studies. Sheen (2007) and Stefanou and Révész (2015) both examined direct feedback versus direct feedback that included metalinguistic comments, whereas Shintani and Ellis (2015) and the current study compared direct feedback to metalinguistic on its own. Moreover, Shintani and Ellis (2015) did not indicate errors and only provided a metalinguistic handout in the L1 leaving learners to identify errors on their own, whereas the current study highlighted errors for students and provided metalinguistic rules in the margins using the L2. Finally, as previously discussed, the numbers of and types of structures varied across studies, making the results difficult to compare.
While there can be no definitive answers at this time, there are some emerging trends, however, that shed some light on what may be happening inside the black box. The most recent results from Shintani and Ellis (2015), Stefanou and Révész (2015), and the current study do not offer clear support to the view that LAA will play a stronger role when the feedback is metalinguistic in nature. The results do, however, lend some backing for the position that LAA is beneficial when learners have to work out the grammar rules for themselves. The current finding that learners with a higher LAA received more benefit than learners with an average LAA in the direct feedback condition agrees with other studies that have found a strong correlation between more inductive instructional conditions and higher language aptitude (e.g. Erlam, 2005; Robinson, 1997).
Likewise, other inquiries have found that learners with lower LAA perform better with more explicit or deductive instruction. Hwu, Pan, and Sun (2014) investigated interactions between LAA and two types of explicit instruction and found that overall two explicit instructional approaches (deductive vs. explicit-inductive) had no differential effect on learning. However, learners with a low level of language aptitude performed significantly better with deductive instruction. In the current study, the metalinguistic feedback, which can be considered more explicit and deductive, proved helpful for individuals with lower LAA. Thus, the results of the current study agree with the overall findings of the aptitude-treatment interaction literature in the sense that as the treatment puts more of a burden on the learner, aptitude becomes more important.
Future studies comparing feedback types should also include some measure of LAA to further test this interpretation. If DF works as a leveler for some structures, and is good for both high and low aptitude types, then, MF on its own may not be the best choice. However, the combination of metalinguistic and direct feedback, provided within the text, as shown by Sheen (2007) and Diab (2015), may be beneficial and necessary for some learners and some structures.
V Conclusions and directions for future research
The findings of the present study confirm the results of prior studies that have found written CF to be beneficial for different types of grammatical errors. Overall, learners in both feedback groups showed improvements in accuracy with both structures in comparison to the control group, which showed no gains.
With respect to LAA, the results indicate a clear relationship between LAA and DF, yet contrary to previous findings by Sheen (2007), MF was not beneficial for learners with higher LAA. It may be that MF that does not include direct correction has been less successful overall compared to DF on its own because for some learners with higher LAA, requiring them to figure out what rule was violated is more beneficial than rule provision.
Contrary to Truscott’s claim (1996) that it is impossible to figure out where someone is at developmentally, and therefore not practical to offer feedback, this study has clear implications for instructed SLA. Direct feedback that requires revision may provide the production practice necessary to proceduralize declarative knowledge. Furthermore, when form-focused pretest scores were included in the model to account for variance in declarative knowledge, the model suggests that the provision of written corrective feedback was effective for a mixed-level population. Teachers often have intuition about each student and know what errors are frequent, so it may be then that writing instructors can target a few errors at a time regardless of proficiency level. Although some have suggested that instruction (e.g. Ortega, 2009) and feedback (e.g. Truscott, 1996, 1999, 2001, 2007) may not alter the developmental route and could potentially be counterproductive or harmful by ignoring readiness, Ortega asserts that ‘it would be mistaken to conclude that instruction does not matter, just because it cannot override development’ (p.100). Learners who receive instruction learn faster and progress further and generally have greater accuracy compared to uninstructed learners (Long, 1983; Long, 1988; Ortega, 2009). Direct feedback, for instance, can provide input and allow learners to confirm or reject their own attempts with a structure, and metalinguistic feedback can provide rules for items that are less salient and often overlooked and thus harder to learn without focused attention. As with oral forms of correction, written feedback on error allows learners an opportunity to notice the gap between their interlanguage and the target language, and to focus on form while revising a meaning-focused writing task.
The current study offers important clues about the effectiveness of written CF for two linguistic targets. Individuals, even those in the same level ESL class, will vary in their starting level, in how they react to feedback, and in their ability to improve over time. Considering the large standard deviations often reported in written corrective feedback studies (e.g. Ferris, 2006), further research which includes a measure of LAA and other measures of individual differences is needed to investigate the effects of different types of written feedback. Moreover, in order to better understand the complex relationship between LAA and written corrective feedback, future studies should explore interactions with other influential variables such as L1, salience, and proficiency level. Although many questions still remain, the findings of the present study confirm a positive role for written corrective feedback in instructed second language acquisition, at least for some learners and some structures.
Footnotes
Appendix 1
Appendix 2
Appendix 3
Acknowledgements
The authors would like to thank Michael Long, Scott Jackson, and Kira Gor for their constructive comments and valuable feedback on earlier versions of this article. We are also grateful for the Language Teaching Research reviewers’ insightful comments and for the guidance of the editors. Last but certainly not least, we would like to thank the participants in the study and the instructors who graciously granted access to their classes.
Declaration of conflicting interest
The authors declare that there is no conflict of interest.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
