The effect of orthography in Mandarin speakers’ production of English voiceless stops

Abstract

This article provides a review of previous studies that have examined the effects of orthography on establishing contrastive phonological representations in second language acquisition and presents results from an original study on Mandarin speakers’ production of English stops investigating how the presence of orthography affects the production of phonological categories that involve allophony. English voiceless stops are canonically represented as aspirated [p^h, t^h, k^h] in word-initial/stressed onset positions but are realized as unaspirated [p, t, k] following /s/ and in unstressed, non-initial onset positions. The results of our imitation experiment showed that Mandarin speakers failed to correctly imitate the unaspirated allophones when presented with written input, and this orthographic effect was stronger with nonwords than with real words. These results are best explained by an orthography effect mediated by phonological awareness and prior linguistic experience.

Keywords

allophony contrast English voiceless stops Mandarin orthography second language learning voice onset time

I Introduction

An important aspect of learning a second language (L2) is to learn its phonological system (Broselow and Kang, 2013; Edwards and Zampini, 2008). This includes, but is not limited to, learning contrastive phonemes (Best and Tyler, 2007; Flege, 1987), phonotactic restrictions (Dupoux et al., 1999; Durvasula et al., 2018), stress (Chrabaszcz et al., 2014; Dupoux et al., 2008; Ou, 2020), tonal contrasts (Hao, 2018; Schaefer and Darcy, 2014; Tsukada and Han, 2019), and intonation (Jun and Oh, 2000; Lee et al., 2019; Lu and Kim, 2016; Mennen, 2007). L2 speech characteristics related to these different aspects of learning have been argued to stem from native language (L1) transfer (Gass, 1996; Major, 2008), cross-linguistic markedness (Eckman, 2008), systematic phonetic or phonological differences between the L1 and L2 (Brannen, 2002; Qin and Tremblay, 2014; Young-Scholten, 2004), and perceptual or production mechanisms constrained by the L1 (Boersma, 2009; Boersma et al., 2003). The effect of orthography, however, on L2 speech characteristics has gained attention only recently (Barrios and Hayes-Harb, 2020; Escudero et al., 2014; Hayes-Harb et al., 2010; Lee-Kim, 2021). This article provides a review of previous studies that have examined the effects of orthography on establishing contrastive phonological representations in L2 acquisition and presents results from an original study on Mandarin speakers’ production of English stops investigating how the presence of orthography affects the production of phonological categories that involve allophony.

1 The effect of orthography on L2 acquisition

Previous studies on the influence of orthography on L2 acquisition have reported conflicting results. While some studies have found the effect to be facilitatory (Erdener and Burnham, 2005; Escudero et al., 2008; Hayes-Harb et al., 2010; Showalter and Hayes-Harb, 2013) or inhibitory (Bassetti, 2006, 2007; Showalter, 2018; Young-Scholten and Langer, 2015), others have found a null effect (Showalter and Hayes-Harb, 2015; Simon et al., 2010) or an effect dependent on target contrasts (Escudero and Wanrooij, 2010; Hayes-Harb and Cheng, 2016).

On the one hand, orthography has been shown to assist category formation when learning confusable L2 sounds. For example, in an eye-tracking study, Escudero et al. (2008) trained Dutch speakers to learn English nonwords that contained /ɛ/ and /æ/; one group learned the words by matching their auditory forms to pictures, while a second group was additionally provided with the spelled forms of the words. The pictures were line drawings from the non-object database used in Shatzman and McQueen (2006). Their results showed that the participants who were only presented with auditory input looked equally at the pictures of the words containing both /ɛ/ and /æ/ upon hearing /ɛ/ and /æ/ words (Escudero et al., 2008). Those who were assisted by the presentation of orthography, however, were more accurate in selecting words containing /ɛ/ but were confused upon hearing words containing /æ/. Based on the a priori finding that the /ɛ/–/æ/ pair is ‘perceptually neutralized’ by Dutch listeners (Weber and Cutler, 2004), Escudero et al. (2008) concluded that the written forms provided explicit information for these perceptually confusable sounds and helped lexically encode the nonnative contrast.

On the other hand, orthography may be inhibitory when there is an incongruence between the native and L2 writing systems despite clear phonetic distinction. For example, the same word-initial <s> in orthography corresponds to [z] in German but to [s] in English. The results of a longitudinal study tracking three English speakers learning German showed that, despite rich aural input from their environment and the distinctive nature of /s/ and /z/ in their native language, these speakers still produced German word-initial <s> as [s], suggesting an inhibitory effect from the orthography (Young-Scholten and Langer, 2015). This study indicates that the direct link between L1 and L2 orthography may impede the learning of an existing native contrast (/s/ vs. /z/) in second language acquisition.

Some studies, however, have found a null effect of orthography. For example, using a word-learning paradigm, Showalter and Hayes-Harb (2015) examined English speakers’ learning of the Arabic velar–uvular /k/–/q/ contrast, with or without exposure to the Arabic script or Romanized Arabic. The results showed that the learners who were given written input had no apparent learning advantage over those who were only shown pictures. Similarly, Simon et al. (2010) used a variety of paradigms to test English speakers learning the French high rounded vowel contrast, /u/ and /y/. Their results also showed no effect of orthography on learning the target contrast.

The mixed results from the previous studies have been attributed to several factors. First, an incongruent grapheme–phoneme correspondence between L1 and L2 tends to negatively impact the learning of an L2 contrast, as in Young-Scholten and Langer’s (2015) study on English speakers’ learning of German word-initial <s>. Similarly, Escudero et al. (2014) tested Spanish speakers in learning different Dutch vowel contrasts, some with congruent grapheme–phoneme correspondences between the L1 and L2 and others with incongruent correspondences. The results showed that orthographic input only aided the learning of congruent contrasts and not the incongruent ones. Furthermore, using a familiar L2 writing system that uses incongruent grapheme–phoneme correspondences with the L1 writing system seems to pose even more challenges to the learner than using an unfamiliar L2 writing system (Hayes-Harb and Cheng, 2016; Showalter, 2018). For example, Pinyin (or Hanyu Pinyin), commonly used in China and Singapore, is a Romanization system that shares the same alphabet as English, while Zhuyin (or Bopomofo), widely used in Taiwan, is a semi-syllabary system derived from components of Chinese characters. Hayes-Harb and Cheng (2016) trained English speakers to learn Mandarin words using Pinyin and Zhuyin. These Mandarin words were either grapheme–phoneme congruent (e.g. <nai> for [nai]) or incongruent (e.g. <xiu> for [ɕiou]) between Pinyin and English. Their results revealed that the group trained with Zhuyin outperformed the one trained with Pinyin in learning the incongruent words, while both groups performed comparably in learning the congruent words. The authors concluded that while English native speakers may benefit from the familiar writing system of Pinyin, learning an entirely new set of symbols in Zhuyin for novel correspondences is more advantageous in that native grapheme–phoneme correspondences can be suppressed.

Second, the effect of orthography may be modulated by the type of contrast. For example, Escudero and Wanrooij (2010) examined native Spanish speakers’ perception of different vowel pairs (/a–ɑ/, /i–ɪ/, /ɪ–ʏ/, /y–ʏ/, and /i–y/) using auditory and orthography tasks. The auditory task employed an XAB paradigm in which the participants were asked to classify the X vowel as either A or B. In the orthography task, the participants were asked to label each vowel using the Dutch orthographic representation (<aa>, <a>, <ie>, , <uu>, ). Based on the accuracy rates from the auditory task, the following difficulty ranking (from most to least difficult) was established: /a–ɑ/ >> /i–ɪ/ >> /ɪ–ʏ/, /y–ʏ/ >> /i–y/. However, the difficulty ranking from the orthography task was reversed: the auditorily most confusable sounds (/a/ and /ɑ/) were more successfully classified in the orthography task than the auditorily most distinct sounds (/y, ʏ, i, ɪ/). The positive orthography effect in the case of the Dutch speakers’ learning the English /ɛ/–/æ/ contrast could also be explained similarly: the auditorily confusable /ɛ/–/æ/ contrast was assisted by the distinct written forms. Showalter and Hayes-Harb (2015: 33) gave a similar account for the lack of an orthography effect in learning the Arabic /k/–/q/ contrast, suggesting that ‘the auditory contrast is so difficult that even when written forms are available, they may simply be unusable.’

Third, the effect of orthography also seems to be moderated by the individual’s level of proficiency in the target L2. While some previous studies used naive learners in their experiments (e.g. Showalter and Hayes-Harb, 2015; Simon et al., 2010), others tested L2 learners of different proficiency levels. For example, Mok et al. (2018) examined Cantonese speakers’ learning of Mandarin tones and showed that the speakers with lower Mandarin proficiency were more affected by orthography than those with higher proficiency. That being said, however, Escudero et al. (2014) did not find any apparent differences between naive listeners and Dutch learners.

Taking all these previous studies into account, the effect of orthography is complex and can be modulated by the familiarity of the mapping (i.e. congruent, incongruent, or entirely unfamiliar), types of contrast (i.e. auditorily confusable or distinct) and L2 proficiency.

2 The effect of orthography on learning phonological alternations

Relatively less research has been devoted to the effect of orthography in relation to the learning of phonological alternations. In a longitudinal study tracking three English speakers learning German, Young-Scholten (2002) observed their acquisition of final devoicing whereby the underlying voicing contrast for obstruents is phonetically neutralized in the word-final position (e.g. the contrast /ʁɑd/ ‘wheel’ vs. /ʁɑt/ ‘advice’ is reflected in spelling (<Rad> vs. <Rat>), but both are realized as [ʁɑt]). Given the fact that English contrasts voicing in all prosodic positions, one would expect English speakers to have no difficulty learning German final devoicing and faithfully produce devoiced final obstruents and intervocalic underlying voiced obstruents (e.g. produce [ɡ], not [k] word-finally in [tɑːk] /taːɡ/ <Tag> ‘day’, as compared to word-medial [tɑɡə] /tɑɡə/ <Tage> ‘day’). However, the results showed that these learners failed to produce the voicing alternation, maintaining the voicing specification signaled in the spelling. Young-Scholten (2002) also observed a negative correlation between the speakers’ improvement of faithfully producing the voicing alternation and how much they engaged in reading and writing activities, suggesting a strong orthographic interference in learning the phonological alternation.

Following up on this line of research, Hayes-Harb et al. (2018) conducted a study employing a picture naming task in which a group of English native speakers were asked to learn German-like nonwords with final voiced or voiceless written forms (e.g. <Trob> vs. <Trop>). Half of the participants learned the words through auditory voiceless final input with matching pictures while the other half were additionally provided with the written forms. Their results showed that the voicing contrast in the written forms severely interfered with their production, guiding them to produce voiced finals when presented with <b, d, g> endings and voiceless finals when presented with <p, t, k> endings, despite the fact that all the auditory input was voiceless.

The aforementioned studies examined whether written forms interfered with the faithful learning of voicing alternations in the word-final position by English speakers whose language also has a voicing contrast. Barrios and Hayes-Harb (2020) further investigated if English speakers could successfully learn the final devoicing process under the mediation of orthography by exposing participants to both suffixed (e.g. /kʁɑkən/ [kʁɑkən] ‘penguins’, /tʁobən/ [tʁobən] ‘forks’) and unsuffixed forms (e.g. /kʁɑk/ [kʁɑk] ‘penguin’, /tʁob/ [tʁop] ‘fork’). Note that the stimuli included both non-alternating (e.g. /kʁɑk/ [kʁɑk] vs. /kʁɑkən/ [kʁɑkən]) and alternating forms (e.g. /tʁob/ [tʁop] vs. /tʁobən/ [tʁobən]). In a picture naming task, participants were either presented with only auditory suffixed and unsuffixed forms (No Orthography group) or auditory and written forms (Orthography group). The participants in the No Orthography group were observed to produce the underlyingly voiced obstruents as voiceless significantly more often in the final than in the nonfinal position, suggesting that the final devoicing process had been successfully learned. The production of those in the Orthography group, however, was still strongly guided by the written voicing contrast.

Along the same line of research, Shea (2017) investigated the effect of orthography in the processing of L2 allophonic alternations, specifically the Spanish voiced stops /b, d, ɡ/ that are realized as approximants [β, ð, ɣ] intervocalically (e.g. [naða] /nada/ ‘nothing’). In a cross-modal priming experiment, English learners of Spanish were asked to make lexical decisions on auditory targets after being presented with written primes. The written primes and audio targets were either identical (e.g. Prime <cabeza> /kabesa/ ‘head’ – Target [kabesa]) or minimal pairs with the auditory target spirantized (e.g. Prime <nada> /nada/ ‘nothing’ – Target [naða]). Compared with a within-model condition (auditory–auditory) and a control group of native Spanish speakers, the English learners’ lexical decisions were faster in the identical condition than in the minimal pair condition, suggesting a direct connection between the written forms <b, d, ɡ> and the canonical representations [b, d, ɡ], but not the variants [β, ð, ɣ].

Using the same priming paradigm to test Mandarin learners of Korean on obstruent–nasal alternations, however, Han and Kim (2022) reported different findings. Korean has a regular phonological process whereby a syllable-final obstruent is nasalized when immediately followed by a nasal (e.g. /kak.mok/ – [kaŋ.mok] ‘stick’). That is, an obstruent and its corresponding nasal are neutralized in the syllable-final position (e.g. neutralization of /k/ and /ŋ/ in /paŋ.mun/ → [paŋ.mun] ‘visit’ and in /pak.mjʌl/ → [paŋ.mjʌl] ‘eradication’. They conducted a similar priming experiment in which the written prime and audio target were either identical (e.g. Prime <식당> /sik.taŋ/ – Target [sik.taŋ] ‘restaurant’) or minimal pairs with the auditory target undergoing nasalization (e.g. Prime <식물> /sik.mul/ – Target [siŋ.mul] ‘plants’). This cross-modal priming condition was compared to a within-modal condition (auditory–auditory), and native Korean speakers were used as a control group. The results showed that lexical decisions were facilitated in both the identical condition and the minimal pair condition, suggesting that there was no interference from the dissociation between the written forms (i.e. the obstruents) and the audio forms (i.e. the nasals). This finding is contrary to those from the previous studies which found evidence of orthography interfering with the learning of phonological alternations.

To summarize so far, orthography has been generally found to impede the correct learning of phonologically alternating sounds. The difference in orthographic representations (e.g. and in German) may create an illusory contrast in the auditory form, interfering with an existing phonological contrast. The same orthographic representations (e.g. <b, d, ɡ> in Spanish) may also impede the learning of allophonic variants. A priming study has alternatively shown no apparent interference from orthography in learning phonological alternations involving Korean obstruents and nasals.

3 English stop allophony

This study also examines a case of phonological alternation. Unlike the studies on German final devoicing and Korean syllable-final obstruent–nasal alternations whereby two graphemes are realized/neutralized as one phonetic form in the word-final position and before a nasal, this study focuses on alternations where different phonetic forms are represented by one single grapheme, a case more similar to the Spanish <b, d, ɡ> where the same graphemes are produced as [β, ð, ɣ] intervocalically and [b, d, ɡ] elsewhere.

There are multiple allophones of English voiceless stops. When /p, t, k/, commonly written as <p, t, k/ch/c/ck>, occur word-initially or in a stressed onset, they are produced as aspirated [p^h, t^h, k^h] (e.g. p urple, t urtle, K ansas). We use <p, t, k> as a shorthand hereafter. Since these aspirated variants occur in salient positions (word-initial/stressed onsets), they are taken as the canonical representations of voiceless stops (Whalen et al., 1997). When these voiceless stops immediately follow /s/, they are produced as unaspirated (e.g. [p, t, k] in s p ace, s t ar, s k y), similar to the surface forms for English voiced stops /b, d, ɡ/ in the onset position (Flege, 1982; Keating, 1984). In some American dialects, unaspirated allophones also occur in the onset of a non-initial unstressed syllable (e.g. [p, k] in o p en, fo c us) (Rogers, 2014). Alveolar stops in the same position, however, are weakened into flaps (e.g. wri t [ɾ]er) in some American dialects (Clark et al., 2007). In other words, the same graphemes correspond to different phonetic forms, as summarized in Table 1.

Table 1.

English voiceless stops allophony.

	<p>	<t>	< k>
Word-initial/stressed onset	[p^h]	[t^h]	[k^h]
Following /s/	[p]	[t]	[k]
Non-initial unstressed	[p]	[ɾ]	[k]

Apart from the canonical aspirated contexts, voiceless stops are unaspirated when they follow /s/, a contextual rule that is oftentimes explicitly taught in classroom settings (Ho, 1987). The non-initial unstressed context, however, depends on a higher-order, prosodic level of knowledge relevant to stress that is rarely noticed nor explicitly taught.

To characterize the English learning experience of Taiwan Mandarin speakers, we asked 82 individuals (61 women, 21 men; aged 19–60 years; M = 33.2) to fill out an online questionnaire; 90.2% of the participants reported that they learned American English while only 7.3% reported that they learned British English. The remaining 2.4% responded that they had learned a mixture of dialects.

The majority of the respondents (62.2%) had been taught in a classroom setting to produce <p, t, k> immediately following <s> as <b, d, ɡ>. Far fewer (19.5%) had not learned this generalization. A small percentage (15.8%) had realized this production difference themselves without being taught; one of these 13 individuals even explicitly responded that <p, t, k> in this context are produced without aspiration. Only two respondents (2.4%) did not remember if they had been taught this generalization. In the non-initial unstressed context, however, only 14 respondents (17%) had learned the generalization before while fewer (8.5%) had realized it themselves. The majority (71.9%) was not aware of this rule. About half of the respondents (51.2%) had not learned to produce non-initial unstressed /t/ differently, while 37.8% had. A smaller percentage (9.7%) were not taught but realized that /t/ in this context is produced differently. Table 2 shows the distribution of the responses.

Table 2.

English learning experience based on questionnaire responses from 82 Taiwan Mandarin speakers (percentages).

Allophone contexts	Learned in classroom	Untaught but known	Not aware	Others
Following /s/	62.2	15.8	19.5	2.4
Non-initial unstressed /p, k/	17	8.5	71.9	2.4
Non-initial unstressed /t/	37.8	9.7	51.2	1.2

The English learning experience described above was reflected in the self-reported productions of <p, t, k> in different contexts. In the same questionnaire, we asked six open-ended questions (3 places × 2 contexts) about how <p, t, k> was produced immediately following /s/ and in the non-initial unstressed onset position by giving example words (e.g. How do you produce the ‘t’ in ‘store’? How do you produce the ‘c’ in ‘local’?). The participants generally described how they produced these sounds by making reference to a similar sound in other English words (e.g. the ‘t’ in ‘store’ is similar to the ‘d’ in ‘door’) or Mandarin equivalents (e.g. the ‘t’ in ‘store’ is similar to ‘ㄉ’ [t] in Bopomofo). The responses are summarized in Table 3. The majority of the respondents (87.3%) produced <p, t, k> following <s> as unaspirated. The pattern was reversed for <p, k> in the non-initial unstressed position: 91.4% produced these stops as aspirated while only 8.5% produced them as unaspirated. For non-initial unstressed <t>, 75.6% produced it as aspirated, and 20.7% as unaspirated; only 2.4% reported that they produced it as a flap.

Table 3.

Self-reported productions of <p, t, k> in different contexts based on questionnaire responses from 82 Taiwan Mandarin speakers (percentages).

Allophone contexts	Aspirated	Unaspirated	Flap	Others
Following /s/	12.1	87.3	n/a	0.004
Non-initial unstressed /p, k/	91.4	8.5	n/a	0
Non-initial unstressed /t/	75.6	20.7	2.4	1.2

The responses from the questionnaire show that the unaspirated context following /s/ is often explicitly taught in classroom settings and is more easily detected by Mandarin speakers. In contrast, the non-initial unstressed context, which depends on a higher-order, prosodic level of knowledge relevant to stress, was rarely noticed nor was it explicitly taught.

Beyond their L2 learning experience, Mandarin speakers are well acquainted with aspiration contrasts in their own L1. Voice onset time (VOT), the interval between the release of closure and the start of voicing, is often used as a measure for voicing and aspiration (Ladefoged and Johnson, 2010). A positive VOT value indicates voicelessness, whereas a negative value indicates that the segment is voiced. Aspiration is evidenced by a longer positive interval, while shorter positive intervals are observed when there is no aspiration. The mean VOTs generally fall within 11–24 ms for English voiceless unaspirated stops and 62–86 ms for aspirated stops in the word-initial position (e.g. Chao and Chen, 2008; Yang, 2021).

Mandarin Chinese contrasts aspiration (e.g. [pa] ‘father’ vs. [p^ha] ‘afraid’; [ta] ‘hit’ vs. [t^ha] ‘tower’; [ka] ‘scratching sound’ vs. [k^ha] ‘stop’), and the VOT values for unaspirated and aspirated stops fall well within the range of English unaspirated and aspirated stops, as shown in Table 4 (e.g. Chao and Chen, 2008). Yet Mandarin speakers often aspirate English voiceless stops in unaspirated contexts (e.g. Lin, 2007).

Table 4.

English and Mandarin voiceless stop voice onset time (VOT) means in milliseconds (Chao and Chen, 2008: 223).

	Voiceless unaspirated			Voiceless aspirated
	[p]	[t]	[k]	[p^h]	[t^h]	[k^h]
Mandarin	14	16	27	82	81	92
English	11	22	24	62	73	86

Using an imitation production experiment, this study examines if Mandarin speakers’ failure to observe English voiceless allophones stems from the fact that these different phonetic forms share the same orthographic representations. We further investigate if the effect of orthography can be mediated by different phonological contexts and lexicality.

II Imitation experiment

Previous studies have shown that when imitating auditory prompts, not all phonetic cues are imitated equally (e.g. Mitterer and Ernestus, 2008; Nielsen, 2011). Phonetic cues are imitated differently according to their relevance in one’s native phonological system (Kim and Clayards, 2019; Kwon, 2019; Lu and Lee-Kim, 2021). Since Mandarin Chinese contrasts aspiration, and the average VOTs of aspirated and unaspirated stops fall well within the range of those in English (Table 4), we expect that VOT cues related to unaspirated and aspirated stops will be successfully imitated by Mandarin speakers. Failing to imitate the aspiration difference when orthography is provided and producing voiceless stops in unaspirated contexts as canonical aspirated stops, however, could be attributed to the inhibitory effect of orthography.

In the imitation experiment, a group of Taiwan Mandarin speakers was asked to imitate English stimuli presented aurally with or without English orthography. We obtained their baseline productions of voiceless stops in different contexts and compared them with the imitated productions. The different contexts included the explicitly taught, segmentally adjacent /s/ context (following /s/) and the higher-order, prosodically driven context (non-initial unstressed). We also included real words as well as nonwords to account for possible frequency effects from existing words since the participants would not have prior exposure to these nonwords. The difference between the baseline productions and imitated productions mediated by the presence or absence of written forms was used to measure the effects of orthography.

If orthography impedes the accurate imitation of English voiceless stops, we expect the following results in the Mandarin speakers’ imitated productions:

1. Canonical aspirated stops will appear in unaspirated contexts, more so for the participants who are exposed to the written forms than those who are not.

2. The orthography effect will be stronger in the higher-order, prosodically driven context compared to the explicitly taught, segmentally adjacent context due to the stronger awareness of the latter context.

3. The orthography effect will be stronger for nonwords than for real words since participants may have prior exposure to the real words and should be less affected by their written forms.

Finally, we made the following prediction for alveolar stops in the unstressed onset context based on the additional flap allophone:

4. Canonical aspirated stops will be observed more frequently when written forms are presented, while other allophones (flaps or unaspirated stops) will be observed more frequently when no written input is given.

1 Participants

Forty participants were recruited (21 women, 18 men, 1 nonbinary; aged 20–36 years; M = 23.65) at National Yang Ming Chiao Tung University to participate in this experiment. All participants reported Mandarin as their first language, and their mean self-rated English speaking ability was 4.38 (SD = 1.11) on a 7-point scale, with 1 indicating poor and 7 proficient. Their mean onset age of English learning was 6.15 years (SD = 2.26), and their mean length of English learning was 16.73 years (SD = 4.04). In Taiwan, English exposure could start as early as Grade 3 in elementary school. None of the participants had lived abroad for more than six months, so all the participants had learned English in the school setting, similar to the learning experience described in the questionnaire responses reported in Section I.3. The majority of the participants (36 out of the 40) reported that they only used English in English-related classes; 3 also used English at work; only 1 used English more broadly with friends. None reported any hearing deficiencies. All participants were compensated monetarily for their time.

2 Design and materials

A total of 288 monosyllabic/disyllabic stimuli were compiled, half being real words and the other half nonwords (see Appendix 1 in supplemental material). The real words were commonly used English words taken from the ‘senior high school English word-list’ created by the College Entrance Examination Center in Taiwan (https://www.ceec.edu.tw) with the exception of four proper nouns; the nonwords were selected or modified from the ARC Nonword Database (Rastle et al., 2002). Both the real word and nonword lists included 24 items with voiceless stops following /s/ (sT Context), 24 items with these stops as onsets of an unstressed, non-initial syllable (UnstressedT Context), and 24 initial voiceless and voiced stops as controls (T and D Contexts, respectively). These items were counterbalanced in terms of place of articulation (labial, alveolar, velar). Crucially, each stimulus contained only one oral stop in order to avoid any potential assimilation or dissimilation effects (Kenstowicz and Suchato, 2006). Forty-eight monosyllabic and disyllabic fillers (24 items each) were also included in each list to serve as distractors. The onset of the monosyllabic fillers was either a fricative or an affricate, as was the onset of the unstressed second syllable of the disyllabic fillers. Sample stimuli are provided in Table 5.

Table 5.

Sample stimuli of the imitation experiment.

Lexicality		Word			Nonword
Items	Context	Labial	Alveolar	Velar	Labial	Alveolar	Velar
Control	T	peace	tough	kiss	piv	tace	kelve
Control	D	bean	dance	goose	bazz	doash	gurn
Experimental	sT	spoil	staff	skin	spave	stush	skeel
Experimental	UnstressedT	happy	photo	circus	seppin	zutter	halcon
(8 * 3 (place) * 4 (context) + 48 (fillers)) * 2 (lexicality) = 288

The stimuli were recorded in isolation by a male native speaker of North American English in which the unstressed, non-initial alveolar stop is produced as a flap (e.g. [ˈfoʊɾoʊ] pho t o) and those in other places are produced as unaspirated (e.g. [ˈhæpi] ha pp y). He was a monolingual English speaker with a good command of Mandarin after studying in Taiwan for three years. Setting the stimuli produced as flaps aside (the 48 alveolar tokens in the UnstressedT context), we manually labeled and measured the stop VOTs using Praat (Boersma and Weenink, 2017). Figure 1 shows the acoustics of the stimuli visualized using the ggplot2 package (Wickham, 2009) in R (R Core Team, 2017).

Figure 1.

The Voice onset times (VOTs) of the recorded stimuli.

The results of a two-way ANOVA using the aov() function and corresponding post-hoc tests using TukeyHSD() function in R showed significant Context (F(3, 168) = 274.05, p < .0001) and Lexicality (F(1, 168) = 7.68, p = .006) main effects as well as their interaction (F(3, 168) = 11.5, p < .001). The Context main effect was driven by the difference found in the following pairs: sT-D, T-D, UnstressedT-D, T-sT and UnstressedT-T contexts (all p < .0001). These significant comparisons indicated a phonetic difference across the voiced (D), voiceless unaspirated (sT, UnstressedT) and voiceless aspirated (T) contexts. The majority (33 out of 48) of the stimuli in the D context had negative VOTs. Word-initial voiced stops (e.g. dance) in English are usually produced with a short-lag VOT, similar to the voiceless stops after /s/ (Section I.3). We attributed the voicing of the stimuli in the D context to the fact that they were produced carefully with a clear speech mode in a lab setting. Though produced as short-lag, the average VOT in the UnstressedT context (M = 13.81 ms, SD = 6.14 ms) was slightly shorter than that in the sT context (M = 18.17, SD = 5.56 ms), though the post-hoc test revealed that the difference was not significant (p > .05). We attributed this to polysyllabic shortening whereby syllabic duration decreases as the word length increases (Gibson and Summers, 2018) since the stimuli in the sT context were monosyllabic while those in the UnstressedT context were disyllabic. The VOTs of the nonwords (black) in the D context were significantly different from those of the words (gray) (p < .001); the VOT values of the nonwords in this context varied more than those of the words (SD_nonword = 71.71, SD_word = 44.18), a phenomenon that is commonly observed for nonwords (Coltheart and Ulicheva, 2018; Mousikou et al., 2017). No other simple effects of Lexicality were found.

3 Procedure

The 40 participants were randomly assigned to the audio group (11 women, 8 men, 1 nonbinary; aged 20–36 years; M = 23.7) or the mixed group (10 women, 10 men; aged 20–31 years; M = 23.6). They were recorded individually in a sound attenuated booth using a Marantz PMD661A recorder at a sampling rate of 44.1 kHz, 16 bits. An AKG P220 large-diaphragm condenser microphone was connected to the recorder and placed on a stand facing the participant’s mouth approximately 20 cm away. There were three sessions (baseline, nonword and real word) with the order of the latter two randomized for each participant. In the baseline session, the 144 items on the real word list were randomized and presented visually in writing to all the participants regardless of the grouping. They were instructed verbally and with written instructions on the computer screen to read the words at a comfortable speaking rate. Since baseline productions with written forms were not included in the previous studies on the orthography effect reviewed above, the motivation for using this method should be explained. One research question of interest here is whether the orthography effect will be stronger for nonwords than for real words since participants may have prior exposure to the real words and should be less affected by the presence of written forms. To ensure the activation of the target words in the participants’ stored exemplars, we chose to use written instead of aural prompts to elicit the baseline productions.

For the nonword and real word sessions, the participants were instructed to listen to the 288 items randomly presented to them and to repeat them as closely as possible. For the participants who were assigned to the audio group, the stimuli were aurally presented after 700 ms immediately following a fixation cross displayed on the screen; for those who were assigned to the mixed group, the aural stimuli were played after 700 ms immediately following their corresponding written forms on the screen (Vendelin and Peperkamp, 2006). For each session, there were six practice trials before the experiment to familiarize the participants with the task. Participants were given 2,500 ms to respond before the next trial started. During this time, each prompt word (in 18 point Courier New font) and fixation cross was centered on the screen. An 800 ms inter-trial interval with no visual input was given before the next trial started. All sessions were conducted in E-Prime (Schneider et al., 2012). The total duration of the experiment was 40 minutes. All productions of the experimental items were manually labeled in Praat (Boersma and Weenink, 2017). Two phoneticians checked and agreed on the labeling of each stop. Stops produced with a negative VOT, flapping, or aspiration were also labeled. Except for the stops labeled as flaps, all VOTs were measured.

III Results

1 Baseline production

We first examined the results from the baseline productions to compare with those of the imitated speech of the participants in the audio and the mixed groups. Out of the 3,840 total tokens (8 items × 3 places × 4 contexts × 20 participants × 2 groups), 22 alveolar tokens in the UnstressedT context that were produced as flaps were set aside to be analysed in a separate analysis (see Table 1). We excluded 44 tokens that were missing or mispronounced, and 42 tokens with VOT values 2.5 standard deviations away from the mean (exclusion rate: 2.2%). Tokens were considered mispronounced when there was a change in manner or place or when the whole segment was substituted with another; tokens that were imperceptible or were unclearly produced were considered missing. The baseline VOT productions are shown in Figure 2. All data are available at https://osf.io/ns4tk/?view_only=eb3b0646afc040f8a5b34e26bb540609.

Figure 2.

Voice onset times (VOTs) of the participants’ baseline productions as a function of group (audio vs. mixed) and context (T, D, sT, and UnstressedT).

The y-axis represents the participants’ VOT values, and the x-axis represents the four different contexts. The results from the audio group are shown in black, while those from the mixed group are in gray. From visual inspection, the baseline productions were comparable between the two groups. The VOTs in the UnstressedT context (M_audio = 74.23 ms; M_mixed = 69.06 ms) were relatively closer to those in the T context (M_audio = 93.49 ms; M_mixed = 90.93 ms), suggesting canonical aspirated stops were produced in this unaspirated context, whereas the VOTs in the sT context (M_audio = 24.22 ms; M_mixed = 24.15 ms) were closer to those in the D context (M_audio = 22.05 ms; M_mixed = 22.31 ms), suggesting more unaspirated stops were produced in this context. That being said, we still observed many outliers with long VOT values in the sT context: 45 out of 948 tokens were produced as aspirated in the sT context. These aspirated tokens were produced by only 11 out of the 40 participants, evenly distributed across the three places of articulation (alveolar: 15; dorsal: 17; labial: 13) and observed on random words. These findings reflect the learning experience described in the questionnaire responses: the majority of Taiwan Mandarin learners of English are aware of the unaspirated variant in the sT context but not in the UnstressedT context (Tables 2 and 3).

To confirm that the baseline productions were comparable between the audio and mixed groups, a mixed-effects linear regression analysis was conducted in R using the lme4 package (Bates et al., 2015), and associated p-values were obtained using the lmerTest package (Kuznetsova et al., 2016). The dependent variable was the raw VOTs (in ms) from the baseline productions. The model included fixed effects for group (audio vs. mixed) and context (T, D, sT, and UnstressedT) and their interactions. The model also included the random intercept for participant as well as the by-participant slope for context. The results of the statistical model are summarized in Table 6.

Table 6.

Summary of fixed effects for the baseline production.

	B	SE	t	p
(Intercept)	74.27	2.88	25.83	< .001***
Mixed	−5.31	4.07	−1.31	.199
T	19.52	2.98	6.56	< .001***
D	−52.28	2.94	−17.80	< .001***
sT	−50.07	2.81	−17.82	< .001***
Mixed:T	3.20	4.21	0.76	.451
Mixed:D	5.64	4.16	1.36	.183
Mixed:sT	5.31	3.98	1.34	.189

Notes. Model: lmer (VOT (in ms) ~ group (ref = audio) * context (ref = UnstressedT) + (1 + context | participant). *** p < .001.

The statistical model showed no difference between the audio and mixed groups in the reference UnstressedT context, as indicated by the lack of a group effect (β = −5.31, p = .2). The lack of group-context interactions (all p > .1) suggests no group difference in the other contexts as well. The VOTs from the reference UnstressedT context were significantly longer than those in the D context (β = −52.28, p < .001) and sT context (β = −50.07, p < .001), but shorter than those in the T context (β = 19.52, p < .001). Although the behaviors were comparable between the two groups, we observed that the participants produced tokens in the UnstressedT context as aspirated stops most frequently (897 of 912 valid tokens were aspirated), compared with the other unaspirated context, sT (only 45 of 948 valid tokens were aspirated). These categorical labels were reflected in the VOT measurements, resulting in very close VOT averages for the UnstressedT and T contexts (M_UnstressedT = 71.7 ms, M_T = 92.22 ms), compared with the other unaspirated context, sT (M_sT = 24.19 ms), the VOT averages of which were closer to those for the D context (M_D = 22.18 ms).

For the production of voiceless alveolar stops in the UnstressedT context (e.g. pho t o) (320 tokens [8 real words × 20 participants × 2 groups]), 12 tokens were excluded (mispronounced or missing; 3 from the audio group, 9 from the mixed group; exclusion rate: 3.75%). The numbers of flap, aspirated and unaspirated tokens across the two groups are provided in Table 7. To examine if there were any patterns in the distribution of the productions, a two-way chi-square test was conducted. The results showed that the numbers were comparable between the two groups (χ²(2) = 3.79, p = .15). The predicted vs. actual counts from both groups are listed in Table 7. The predicted counts (obtained using chisq.test()$expected) were what the values of each cell of the table would be if there was no association between the two variables. We also observed a high number of <t> produced as aspirated [t^h] (148 of 157 tokens in the audio group; 133 of 151 in the mixed group).

Table 7.

Actual and predicted counts of flap, aspirated and unaspirated tokens in producing alveolar stops in the UnstressedT context by the participants in the audio and mixed groups.

		Flap	Aspirated	Unaspirated
Audio	Actual	7	148	2
Audio	Predicted	11.21	143.24	2.55
Mixed	Actual	15	133	3
Mixed	Predicted	10.79	137.76	2.45

Taken together, the results from the baseline productions revealed that aspirated stops were frequently observed in the UnstressedT context. Though not as common, aspirated stops were also observed in the sT context. The fact that aspirated stops were produced in these two unaspirated contexts suggests that the written forms <p, t, k> are associated with the canonical aspirated [p^h, t^h, k^h], more so for the higher-order, prosodic context (UnstressedT) than for the segmentally adjacent context (sT). This baseline observation reflects the learning experience of Taiwan Mandarin speakers in which the segmentally adjacent context is more likely to be taught in the classroom setting or independently realized.

2 Imitated production

Out of the 7,680 tokens (8 items × 3 places × 4 contexts × 2 lexicality × 2 groups × 20 participants), 352 tokens that were produced as flaps were set aside for further analysis. We also excluded 105 tokens that were missing or mispronounced, and 172 tokens with VOTs that were 2.5 standard deviations away from the mean (exclusion rate: 3.6%). We provide an analysis on flaps later in this section. The VOT values of the remaining tokens in the imitated productions are shown in Figure 3.

Figure 3.

The participants’ voice onset times (VOTs) (ms) of the imitated productions as a function of context (T, D, sT, UnstressedT) and group (audio vs. mixed) paneled by lexicality (nonword vs. word).

The VOTs of the UnstressedT stimuli in the imitated productions were much shorter than those in the baseline productions (M_baseline = 71.65 ms; M_imitation = 45.01 ms). The number of aspirated stops in the UnstressedT context was also much lower in the imitated productions, a strong indication that the participants were following the imitation task. As previously mentioned, 98% of the valid tokens (897 of 912) in the baseline productions were labeled as aspirated compared to 61% (937 of 1,535 tokens) in the imitated productions. The difference between the baseline and imitated productions in the sT context was limited due to the already lower number of aspirated productions in the baseline. However, we also observed a higher percentage of aspirated tokens in the sT baseline productions than in the sT imitation productions: 4.74% of the tokens (45 of 948) were labeled as aspirated in the baseline productions, while only 0.9% of the tokens (17 of 1,900) were labeled as aspirated in the imitated productions. The participants were still imitating the audio input, but less room was available for imitation in the sT context.

We then compare the degree of imitation mediated by the presence or absence of orthography. We assessed the degree of imitation for each trial using the following formula (Babel, 2012; Kim and Clayards, 2019; Walker and Campbell-Kibler, 2015):

| X_{t a r g e t} - X_{b a s e l i n e} | - | X_{t a r g e t} - X_{i m i t a t i o n} |

The resulting values reflect the number of units (i.e. milliseconds) that the participants moved their production towards the target and away from their own baseline. A number close to 0 indicates limited imitation while a positive or a negative value shows convergence towards or divergence from the target speech, respectively. The baseline measures for each participant’s voiceless stop productions in the T, D, sT, and UnstressedT contexts were the averages of the eight baseline real word productions for each place of articulation. The target measures were the VOTs of the audio stimuli, whereas the imitation measures were the participants’ VOT productions in the nonword and real word sessions upon hearing the aural stimuli. Figure 4 shows the degrees of imitation measured in terms of VOT.

Figure 4.

The participants’ degree of imitation in voice onset time (VOT) (ms) as a function of context (T, D, sT, UnstressedT) and group (audio vs. mixed) paneled by lexicality (nonword vs. word).

The y-axis represents the degree of imitation (the difference in ms between the baseline and imitated productions); the x-axis is the four contexts. The results of the audio group (black) and mixed group (gray) are paneled by lexicality (nonword on the left, real word on the right). From visual inspection, limited imitation was observed for the T, D and sT contexts, while significant imitation was observed for the UnstressedT context. This is due to the aforementioned fact that the number of aspirated stops in the UnstressedT context was significantly lower in the imitated productions. Furthermore, in the UnstressedT context, the degree of imitation seemed to differ between the two groups for nonwords but not for real words. We conducted a mixed-effects linear regression analysis with the UnstressedT context, nonword and audio group set as the reference level. The imitated VOTs were the dependent variable, and group, context and lexicality were the fixed variables. The model also included random intercept for participant as well as by-participant random slopes for context and lexicality. The results of the statistical model are summarized in Table 8.

Table 8.

Summary of fixed effects for the imitated production.

	B	SE	t	p
(Intercept)	37.77	2.70	13.98	< .001***
Mixed	−13.50	3.82	−3.54	.001***
T	−40.68	2.94	−13.84	< .001***
D	−41.03	2.80	−14.66	< .001***
sT	−37.81	2.86	−13.22	< .001***
Word	−11.30	1.16	−9.77	< .001***
Mixed:T	14.35	4.15	3.46	.001**
Mixed:D	14.92	3.96	3.77	< .001***
Mixed:sT	14.41	4.04	3.57	.001***
Mixed:word	6.98	1.62	4.31	< .001***
T:word	12.33	1.46	8.47	< .001***
D:word	12.96	1.47	8.84	< .001***
sT:word	10.60	1.45	7.32	< .001***
Mixed:T:word	−5.59	2.05	−2.73	.006**
Mixed:D:word	−6.85	2.07	−3.31	.001***
Mixed:sT:word	−6.96	2.03	−3.42	.001***

Notes. Model: lmer (imitated VOT (in ms) ~ group (ref = audio) * context (ref = UnstressedT) * lexicality (ref = nonword) + (1 + context + lexicality | participant). ** p < .01 *** p < .001.

The fitted model showed that participants’ imitations were significantly better in the reference UnstressedT/nonwords context, that is, when they were not given orthographic input (β = −13.5, p = .001). This orthographic effect, however, was curtailed in the imitation of the real words as indicated by group-lexicality interaction (β = 6.98, p < .001). Post-hoc tests adjusted by the Tukey method using the emmeans package (Lenth, 2021) in R confirmed that the presence or absence of written forms did not have a significant effect on the production of real words in the UnstressedT context (p = .94). This group effect was only observed for nonwords. Post-hoc comparisons revealed no group effect in the other contexts.

We then analysed the imitated patterns for the non-initial alveolar tokens in the UnstressedT context which were produced as flaps in the aural stimuli presented to the participants. We compared the imitated results with the baseline productions in which the canonical aspirated tokens dominated, and flap and unaspirated tokens were only sporadically observed (see Table 7). Crucially, no distributional difference was observed between the two groups in the baseline production. Of the 640 tokens in imitated productions (8 items × 2 lexicality × 2 groups × 20 participants), 7 were excluded because they were either missing or mispronounced. A two-way chi-square test of the observed distributions of flap, aspirated and unaspirated productions revealed significant results (χ²(2) = 24.038, p < .001), suggesting that the aspirated tokens were over-represented in the mixed group but under-represented in the audio group. The reverse, however, was found for the unaspirated tokens: they were over-represented in the audio group but under-represented in the mixed group, as illustrated in Figure 5. The number of flap tokens, though slightly greater for audio than for mixed group, was comparable between the two groups. The predicted vs. actual counts of the two groups’ flap, aspirated and unaspirated tokens are listed in Table 9. We found a sharp contrast between the imitated and baseline productions: flap tokens dominated in the imitated speech, while aspirated tokens dominated in the baseline speech. Furthermore, more canonical aspirated tokens were observed in the productions of the mixed group than in those of the audio group. This confirmed that the participants were indeed imitating the auditory input, and the imitations were better when the written forms were not provided than when they were.

Figure 5.

A mosaic plot for the distribution of flaps, aspirated and unaspirated tokens in imitating alveolar stops in the UnstressedT context in the audio and mixed groups.

Table 9.

Actual and predicted counts of flap, aspirated and unaspirated tokens in the imitation of alveolar stops in the UnstressedT context by the participants in the audio and mixed groups.

		Flap	Aspirated	Unaspirated
Audio	Actual	184	60	70
Audio	Predicted	176.6	84.33	54.07
Mixed	Actual	170	110	39
Mixed	Predicted	178.4	85.67	54.93

For each group, we further divided the tokens according to lexicality (see Table 10). The results of two additional chi-square tests showed that the numbers of flaps, aspirated stops and unaspirated stops were not evenly distributed in the nonwords (χ²(2) = 29.99, p < .001). Specifically, aspirated stops were over-represented in the mixed group whereas unaspirated stops were under-represented, and the opposite was found for the audio group. That is, the canonical aspirated tokens were observed more in the productions of the mixed group than in those of the audio group. This distributional difference was not observed for real words (χ²(2) = 3.23, p = .2).

Table 10.

Actual and predicted counts of flap, aspirated and unaspirated tokens in imitating alveolar stops in the UnstressedT context according to lexicality by the participants in the audio and mixed groups.

		Nonword			Word
		Flap	Aspirated	Unaspirated	Flap	Aspirated	Unaspirated
Audio	Actual	90	22	45	94	38	25
Audio	Predicted	87.39	39.5	30.12	87.94	45.21	23.85
Mixed	Actual	87	57	16	83	53	23
Mixed	Predicted	89.61	40.5	30.88	89.06	45.79	24.15

To summarize, compared with the baseline productions in which no group difference was observed, the different degrees of imitation in the imitated productions based on whether the written forms were provided or not indicated an orthography effect. In the imitated productions, canonical aspirated stops were less frequently observed in the UnstressedT context compared with the baseline productions, especially in the audio group as indicated by a stronger degree of imitation from the audio group than from the mixed group. This suggests that accurate imitation of unaspirated stops was impeded by the presence of orthography. Furthermore, this group effect was stronger for nonwords than for real words. This suggests that prior exposure to these real words diminished the effect of written forms: the association with the stored lexical exemplars suppressed any possible orthography effects. The same group and lexicality effects were found for the alveolar tokens in the UnstressedT context. Canonical aspirated stops were more frequently observed when written forms were provided, while flap/unaspirated allophones dominated when only aural input was given. This group effect, again, was only found for nonwords, but not for real words.

IV General discussion

This study examined an L2 learning scenario related to phonological alternations mediated by orthography. English voiceless stops, usually represented with <p, t, k> in writing, are realized as various allophones in different contexts: (1) as unaspirated [p, t, k] when they occur after /s/, or (2) as unaspirated [p, k] or flap [ɾ] when they occur in an unstressed non-initial onset. Mandarin speakers, however, often produce the English voiceless stops in these contexts as the canonical aspirated stops [p^h, t^h, k^h] that are associated with the stressed or word-initial onset position. We thus hypothesized that the failure to produce voiceless stop allophones, especially the unaspirated stops that exist as contrastive phonemes in Mandarin, stems from the incongruent correspondence of one grapheme to multiple phonetic forms (Table 1). We designed an imitation experiment in which one group of Mandarin participants was exposed to only aural input while the other was presented with both aural and written forms. The main findings from this experiment are as follows:

Canonical aspirated stops were more frequently found in the UnstressedT context when the participants were exposed to the written forms than when they were not, as indicated by the stronger degree of imitation in the audio group than in the mixed group.

Canonical aspirated stops were more frequently found in the higher-order, prosodically driven unaspirated context than the explicitly taught, segmentally adjacent context, as suggested by the stronger degree of imitation in the UnstressedT context than in the sT context.

Different effects of orthography on nonwords and real words only appear in the UnstressedT context.

For unstressed, non-initial alveolar stimuli, similar group and lexicality effects were found; namely, the canonical aspirated stop [t^h] was observed more frequently when written forms were provided than when they were not, and this pattern was only observed for nonwords but not real words.

These results contribute to the previous discussions on the effect of orthography by showing that incongruent correspondences between the written forms and phonetic forms indeed impede the accurate learning of an L2, despite there being a clear phonetic contrast in the L1 (here, aspirated vs. unaspirated stops). However, unlike the previous studies that showed L2 learners may perceive an illusory contrast in the auditory input according to the contrast represented by written forms (e.g. the same phonetic form [p] corresponding to two written forms and in German), the results of the present study further demonstrate a failure to notice a phonetic contrast when the written forms do not present a corresponding one. The written forms of <p, t, k> in English are strongly connected with the canonical productions of aspirated [p^h, t^h, k^h] that are found in the word-initial and stressed positions (Whalen et al., 1997), and this canonicity seems to level the acoustic differences of other allophones that could be easily noticed by Mandarin L2 learners.

Our findings echo those reported by Shea (2017) who observed that English learners of Spanish failed to connect allophonic variants [β, ð, ɣ] to the written forms <b, d, ɡ>, and only the canonical forms [b, d, ɡ] enjoyed a facilitatory effect. Using imitation instead of Shea’s priming paradigm, this study additionally showed that the written forms obscured the correct imitation of variants, and the effect was mediated by lexicality and the awareness of the variant contexts. Note that, with the exception of [ð], the English speakers did not have the allophonic variants in their native sound inventory; however, /p^h, t^h, k^h/ and /p, t, k/ exist as contrastive phonemes in Mandarin. Our results showed that the orthographic effect is strong enough to dampen a native contrast, especially for nonwords.

Our results, however, were inconsistent with those reported by Han and Kim (2022), who showed Korean obstruent/nasal phonologically conditioned variants primed equally well with or without written forms. Since our study and Han and Kim’s (2022) employed different experimental paradigms and types of contrasts, direct comparisons cannot be easily made and should be interpreted with caution. We speculate that this alternation case is conditioned by segmental adjacency (i.e. a syllable-final obstruent is nasalized when immediately followed by a nasal; e.g. /kak.mok/ – [kaŋ.mok] ‘stick’) and the target contrast is salient (obstruent vs. nasals). This could be a similar case to our sT context where learners can detect and generalize easily and thus be less affected by orthography.

Curious readers may notice that /k/ can be written more variably than /p, t/ and may wonder if the orthography effect might have been of a different magnitude with the velar than with non-velar stimuli and when the written form was <k> than when it was less canonically represented as <ch/c/ck>. Since our stimuli did include these different written forms, we were able to examine this question in the UnstressedT context in which the strongest degree of imitation was observed. We subset the data in the UnstressedT context and excluded those realized as flaps. We first examined the degree of imitation between velar and non-velar stimuli. The average degrees of imitation according to different places of articulation are shown in Table 11.

Table 11.

Average degrees of imitation (standard deviations in parentheses) as a function of place of articulation in the UnstressedT context.

	Labial	Alveolar	Velar
Counts	626	279	630
Degree of imitation (ms)	35.89	25.72	28.32
SD	21.39	20.79	26.82

Based on the results, the degree of imitation for the velar stimuli was between that for the labial and alveolar stimuli. We then examined if different degrees of imitation could be observed across different written forms that represent /k/, as shown in Table 12.

Table 12.

Average degrees of imitation for the dorsal stimuli according to different written forms in the UnstressedT context.

	<k>	<c>	<ch>	<ck>
Counts	236	237	40	117
Degree of imitation (ms)	32.18	28.43	26.76	20.82
SD	27.6	26.87	20.22	25.74

We indeed observed a higher average degree of imitation associated with <k> than with the other three written forms. However, due to the large standard deviation and the unbalanced counts in the stimuli, further research is needed to establish if different written forms indeed elicit orthography effects of varying magnitudes.

Our results also established differential mediation of orthography in various unaspirated contexts. The stops in the segmentally adjacent sT context, which is more easily observed by L2 learners and is usually explicitly taught (see Ho (1987) and our report on the English learning experience questionnaire in Section I.3) were less affected by orthography, presumably due to the phonological awareness of unaspirated stops in this context. On the contrary, stops in the other unaspirated UnstressedT context, which relies on a higher-order, prosodic knowledge and is rarely if ever taught, were severely affected by orthography. This finding has pedagogical implications; specifically, promoting phonological awareness of the unnoticed allophonic distribution may help in accurately learning the phonological alternations. Further research in this direction is called for to investigate if L2 learners who are exposed to different conditions with or without explicit training on the phonological alternations would be subject to the orthography effect to different degrees.

Furthermore, the effect of orthography was observed more strongly with nonwords than real words, presumably due to the well-established connections between the lexical representations and phonetic forms for the real words from existing exemplars. This implies that with more exposure to L2 vocabulary, the effect of orthography should decrease. To examine this possible effect, we obtained the word frequencies of the real word stimuli from Leech and Rayson (2014) (see Appendix 1 in supplemental material) and correlated them with the degrees of imitation. Though slight, we did find a significant negative correlation between the log-transformed word frequency and the degree of imitation (p < .001, r = −0.09). This direction of research is also left for the future, perhaps for a study that only includes real words with various word frequencies to investigate if L2 learners’ experience mediates the effect of orthography.

One limitation of this study concerning the different orthographic effects in real words and nonwords, however, needs to be pointed out. We included the baseline session, in which written forms of the real word stimuli, but not the nonword stimuli, were presented to the participants; one may thus wonder if this could have reduced the degree of imitation for the real words in the audio group. As previous studies have shown a lasting effect from exposure to spelling (e.g. Bürki et al., 2012), the baseline session may have created a ‘mixed’ condition for the real words by exposing the participants to the written forms. If the real words in the baseline session were presented in another form (e.g. picture naming to bypass any possible mediation of orthography), we might have expected a similar degree of imitation in real words and nonwords in the audio group. This, however, did not obscure the observed effects of group (i.e. orthographic) and context (i.e. phonological awareness), which were the most important findings of this study. We leave the use of picture naming in employing imitation paradigm in examining L2 orthography effect to future studies.

Finally, the participants we recruited were Taiwan Mandarin speakers who use a non-alphabetic system (see Section I.1) and traditional Chinese characters to write their L1. Apart from the studies that showed L2 orthography affecting the phonological categorization of nonnative sounds, different L1 orthographic systems (e.g. non-alphabetical Zhuyin vs. alphabetical Pinyin systems) have been shown to also affect how sounds are processed (e.g. Lin and Lin, 2010). The orthography effect in learning English stops could thus be stronger for Mandarin speakers who use Pinyin, an alphabetic system like English. A future study could compare the degrees of orthographic mediation in L2 learning by L1 speakers using different L1 orthographies.

Supplemental Material

sj-docx-1-slr-10.1177_02676583231169127 – Supplemental material for The effect of orthography in Mandarin speakers’ production of English voiceless stops

Supplemental material, sj-docx-1-slr-10.1177_02676583231169127 for The effect of orthography in Mandarin speakers’ production of English voiceless stops by Yu-An Lu and Cheng-Huan Lee in Second Language Research

Footnotes

Acknowledgements

We extend our gratitude to Sang-Im Lee-Kim and Tsung-Ying Chen, as well as the attendees of the Seoul International Conference on Speech Sciences 2019 and the reviewers and editors of Second Language Research, for providing us with valuable comments and insights. We would also like to thank Chih-Chao Chang, Yung-Hsin Hsiao, Shao-Jie Jin, Meng-Hsuan Lin and Shih-Ching Yu for collecting and processing the data. Any remaining errors are ours.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Ministry of Science and Technology of Taiwan Grant (MOST110-2628-H-A49-001-MY2) to Yu-An Lu.

ORCID iDs

Yu-An Lu

Cheng-Huan Lee

Supplemental material

Supplemental material for this article is available online.

References

Babel

(2012) Evidence for phonetic and social selectivity in spontaneous phonetic imitation. Journal of Phonetics 40: 177–89.

Barrios

Hayes-Harb

(2020) Second language learning of phonological alternations with and without orthographic input: Evidence from the acquisition of a German-like voicing alternation. Applied Psycholinguistics 41: 517–45.

Bassetti

(2006) Orthographic input and phonological representations in learners of Chinese as a foreign language. Written Language and Literacy 9: 95–114.

Bassetti

(2007) Effects of hanyu pinyin on pronunciation in learners of Chinese as a foreign language. In: Guder

Jiang

Wan

(eds) The Cognition, Learning and Teaching of Chinese Characters. Beijing, China: Beijing Language and Culture University Press, pp. 156–79.

Bates

Maechler

Bolker

Walker

(2015) Fitting linear mixed-effects models using lme4. Journal of Statistical Software 67: 1–48.

Best

Tyler

(2007) Nonnative and second-language speech perception: Commonalities and complementarities. In: Munro

Bohn

O-S

(eds) Language experience in second language speech learning: In honor of James Emil Flege. Amsterdam: John Benjamins, pp. 1–47.

Boersma

(2009) Cue constraints and their interactions in phonological perception and production. In: Boersma

Hamann

(eds) Phonology in perception. Berlin: Mouton de Gruyter, pp. 55–110.

Boersma

Escudero

Hayes

(2003) Learning abstract phonological from auditory phonetic categories: An integrated model for the acquisition of language-specific sound categories. In: Proceedings of the 15th International Congress of Phonetic Sciences, pp. 1013–16. Barcelona: Universitat Autònoma de Barcelona.

Boersma

Weenink

(2019). Praat: Doing phonetics by computer (Version 6.0.26) [computer program]. Available at: http://www.praat.org (accessed April 2023).

10.

Brannen

(2002) The role of perception in differential substitution. Canadian Journal of Linguistics 47: 1–46.

11.

Broselow

Kang

(2013) Phonology and speech. In: Herschensohn

Young-Scholten

(eds) The Cambridge handbook of second language acquisition. Cambridge, MA: Cambridge University Press, pp. 529–54.

12.

Bürki

Spinelli

Gaskell

(2012) A written word is worth a thousand spoken words: The influence of spelling on spoken-word production. Journal of Memory and Language 67: 449–67.

13.

Chao

K-Y

Chen

L-m

(2008) A cross-linguistic study of voice onset time in stop consonant productions. International Journal of Computational Linguistics and Chinese Language Processing 13: 215–32.

14.

Chrabaszcz

Winn

Lin

Idsardi

(2014) Acoustic cues to perception of word stress by English, Mandarin and Russian speakers. Journal of Speech, Language, and Hearing Research 57: 1468–79.

15.

Clark

Yallop

Fletcher

(2007) An introduction to phonetics and phonology. Oxford: Wiley-Blackwell.

16.

Coltheart

Ulicheva

(2018) Why is nonword reading so variable in adult skilled readers? PeerJ 6: e4879.

17.

Dupoux

Kakehi

Hirose

Pallier

Mehler

(1999) Epenthetic vowels in Japanese: A perceptual illusion? Journal of Experimental Psychology: Human Perception and Performance 25: 1568–78.

18.

Dupoux

Sebastián-Gallés

Navarrete

Peperkamp

(2008) Persistent stress ‘deafness’: The case of French learners of Spanish. Cognition 106: 682–706.

19.

Durvasula

Huang

H-H

Uehara

Luo

Lin

Y-H

(2018) Phonology modulates the illusory vowels in perceptual illusions: Evidence from Mandarin and English. Laboratory Phonology: Journal of the Association for Laboratory Phonology 9: 7.

20.

Eckman

(2008) Typological markedness and second language phonology. In: Edwards

JGH

Zampini

(eds) Phonology and Second Language Acquisition. Amsterdam: John Benjamins, pp. 95–116.

21.

Edwards

JGH

Zampini

(eds) (2008) Phonology and second language acquisition. Tempe, AZ: Arizona State University.

22.

Erdener

Burnham

(2005) The role of audiovisual speech and orthographic information in nonnative speech production. Language Learning 55: 191–228.

23.

Escudero

Hayes-Harb

Mitterer

(2008) Novel second-language words and asymmetric lexical access. Journal of Phonetics 36: 345–60.

24.

Escudero

Simon

Mulak

(2014) Learning words in a new language: Orthography doesn’t always help. Bilingualism: Language and Cognition 17: 384–95.

25.

Escudero

Wanrooij

(2010) The effect of L1 orthography on non-native vowel perception. Language and Speech 53: 343–65.

26.

Flege

(1982) Laryngeal timing and phonation onset in utterance-initial English stops. Journal of Phonetics 10: 177–92.

27.

Flege

(1987) The production of ‘new’ and ‘similar’ phones in a foreign language: Evidence for the effect of equivalence classification. Journal of Phonetics 15: 47–65.

28.

Gass

(1996) Second language acquisition and linguistic theory: The role of language transfer. In: Flynn

O’Neil

(eds) Linguistic theory in second language acquisition. Dordrecht: Kluwer Academic, pp. 384–403.

29.

Gibson

Summers

(2018) Polysyllabic shortening in speakers exposed to two languages. Bilingualism: Language and Cognition 21: 471–78.

30.

Han

J-I

Kim

(2022) Orthographic activation in spoken word recognition of L2 phonological variants. Second Language Research 38: 671–88.

31.

Hao

Y-C

(2018) Contextual effect in second language perception and production of Mandarin tones. Speech Communication 97: 32–42.

32.

Hayes-Harb

Brown

Smith

(2018) Orthographic input and the acquisition of German final devoicing by native speakers of English. Language and Speech 61: 547–64.

33.

Hayes-Harb

Cheng

H-W

(2016) The influence of the Pinyin and Zhuyin writing systems on the acquisition of Mandarin word forms by native English speakers. Frontiers in Psychology 7: 785.

34.

Hayes-Harb

Nicol

Barker

(2010) Learning the phonological forms of new words: Effects of orthographic and auditory input. Language and Speech 53: 367–81.

35.

D-a

(1987) Shēngyùn xué zhōng de guānniàn hé fāngfǎ [Concepts and methods in phonology]. Taipei: Da-an.

36.

Jun

S-A

(2000) Acquisition of second language intonation. In: Proceedings of the Sixth International Conference on Spoken Language Processing (ICSLP 2000). Volume 4, pp. 73–76. Available at http://doi.org/10.21437/ICSLP.2000-754 (accessed April 2023).

37.

Keating

(1984) Phonetic and phonological representation of stop consonant voicing. Language 60: 286–319.

38.

Kenstowicz

Suchato

(2006) Issues in loanword adaptations: A case study from Thai. Lingua 116: 921–49.

39.

Kim

Clayards

(2019) Individual differences in the link between perception and production and the mechanisms of phonetic imitation. Language, Cognition and Neuroscience 34: 769–86.

40.

Kuznetsova

Brockhoff

Christensen

RHB

(2016) lmerTest: Test in linear mixed effects model. R package version 2.0-33. Available at http://CRAN.R-project.org/package=lmerTest (accessed April 2023).

41.

Kwon

(2019) The role of native phonology in spontaneous imitation: Evidence from Seoul Korean. Laboratory Phonology: Journal of the Association for Laboratory Phonology 10: 10.

42.

Ladefoged

Johnson

(2010) A course in phonetics. 6th edition. Boston, MA: Cengage Learning.

43.

Lee

Shin

D-J

Garcia

MTM

(2019) Perception of lexical stress and sentence focus by Korean-speaking and Spanish-speaking L2 learners of English. Language Sciences 72: 36–49.

44.

Lee-Kim

S-I

(2021) Development of Mandarin tones and segments by Korean learners: From naïve listeners to novice learners. Journal of Phonetics 86: 101036.

45.

Leech

Rayson

(2014) Word frequencies in written and spoken English: Based on the British National Corpus. Abingdon: Routledge.

46.

Lenth

(2021) emmeans: Estimated Marginal Means, aka Least-Squares Means: R package version 1.6.2-1. Available at: https://CRAN.R-project.org/package=emmeans (accessed April 2023).

47.

Lin

H-N

Lin

C-JC

(2010) Perceiving vowels and tones in Mandarin: The effect of literary Phonetic systems on phonological awareness. In: Proceedings of the 22nd North American Conference on Chinese Linguistics (NACCL-22) and the 18th International Conference on Chinese Linguistics (ICCL-18), Harvard University, Cambridge, MA.

48.

Lin

Y-H

(2007) The sounds of Chinese. Cambridge: Cambridge University Press.

49.

Y-A

Kim

(2016) Prosody transfer in second language acquisition: Tonal alignment in the production of English pitch accent by Mandarin native speakers. Tsing Hua Journal of Chinese Studies 46: 785–816.

50.

Y-A

Lee-Kim

S-I

(2021) The effect of linguistic experience on perceived vowel duration: Evidence from Taiwan Mandarin speakers. Journal of Phonetics 86: 101049.

51.

Major

(2008) Transfer in second language phonology: A review. In: Edwards

JGH

Zampini

(eds) Phonology and second language acquisition. Amsterdam: John Benjamins, pp. 63–94.

52.

Mennen

(2007) Phonological and phonetic influences in non-native intonation. Trends in linguistics studies and monographs, Volume 186. Berlin: Mouton de Gruyter.

53.

Mitterer

Ernestus

(2008) The link between speech perception and production is phonological and abstract: Evidence from the shadowing task. Cognition 109: 168–73.

54.

Mok

P. P. K.

Lee

J. J.

R. B

. (2018). Orthographic effects on the perception and production of L2 mandarin tones. Speech Communication, 101, 1–10.

55.

Mousikou

Sadat

Lucas

Rastle

(2017) Moving beyond the monosyllable in models of skilled reading: Mega-study of disyllabic nonword reading. Journal of Memory Language 93: 169–92.

56.

Nielsen

(2011) Specificity and abstractness of VOT imitation. Journal of Phonetics 39: 132–42.

57.

S-c

(2020) Perceptual training on lexical stress contrasts: A study with Taiwanese learners of English as a foreign language. Cham: Springer Nature.

58.

Qin

Tremblay

(2014) Effects of native dialect on Mandarin listeners’ use of prosodic cues to English stress. In: Proceedings of the 7th International Conference on Speech Prosody, Dublin, Ireland.

59.

R Core Team (2017). R: A language and environment for statistical computing [software]. Vienna: R Foundation for Statistical Computing. Available at: http://www.R-project.org (accessed April 2023).

60.

Rastle

Harrington

Coltheart

(2002) 358,534 nonwords: The ARC nonword database. The Quarterly Journal of Experimental Psychology 55: 1339–62.

61.

Rogers

(2014) The sounds of language: An introduction to phonetics. Abingdon: Routledge.

62.

Schaefer

Darcy

(2014) Linguistic prominence of pitch within the native language determines accuracy of tone processing. In: Miller

Martin

Eddington

, et al. (eds) Selected Proceedings of the 2012 Second Language Research Forum: Building Bridges between Disciplines. Somerville, MA: Cascadilla Proceedings Project, pp. 1–14.

63.

Schneider

Eschman

Zuccolotto

(2012) E-prime user’s guide. Pittsburgh, PA: Psychology Software Tools.

64.

Shatzman

McQueen

(2006) Prosodic knowledge affects the recognition of newly acquired words. Psychological Science 17: 372–77.

65.

Shea

(2017) L1 English/L2 Spanish: Orthography–phonology activation without contrasts. Second Language Research 33: 207–32.

66.

Showalter

(2018) Impact of Cyrillic on native English speakers’ phono-lexical acquisition of Russian. Language and Speech 61: 565–76.

67.

Showalter

Hayes-Harb

(2013) Unfamiliar orthographic information and second language word learning: A novel lexicon study. Second Language Research 29: 185–200.

68.

Showalter

Hayes-Harb

(2015) Native English speakers learning Arabic: The influence of novel orthographic information on second language phonological acquisition. Applied Psycholinguistics 36: 23–42.

69.

Simon

Chambless

Alves

(2010) Understanding the role of orthography in the acquisition of a non-native vowel contrast. Language Sciences 32: 380–94.

70.

Tsukada

Han

J-I

(2019) The perception of Mandarin lexical tones by native Korean speakers differing in their experience with Mandarin. Second Language Research 35: 305–18.

71.

Vendelin

Peperkamp

(2006) The influence of orthography on loanword adaptations. Lingua 116: 996–1007.

72.

Walker

Campbell-Kibler

(2015) Repeat what after whom? Exploring variable selectivity in a cross-dialectal shadowing task. Frontiers in Psychology 6: 546.

73.

Weber

Cutler

(2004) Lexical competition in non-native spoken-word recognition. Journal of Memory and Language 50: 1–25.

74.

Whalen

Best

Irwin

(1997) Lexical effects in the perception and production of American English /p/ allophones. Journal of Phonetics 25: 501–28.

75.

Wickham

(2009) ggplot2: Elegant graphics for data analysis. New York: Springer.

76.

Yang

(2021) Comparison of VOTs in Mandarin–English bilingual children and corresponding monolingual children and adults. Second Language Research 37: 3–26.

77.

Young-Scholten

(2002) Orthographic input in L2 phonological development. In: Burmeister

Piske

Rohde

(eds) An integrated view of language development: Papers in honor of Henning Wode. Trier: Wissenschaftlicher Verlag, pp. 263–79.

78.

Young-Scholten

(2004) Prosodic constraints on allophonic distribution in adult L2 acquisition. International Journal of Bilingualism 8: 67–77.

79.

Young-Scholten

Langer

(2015) The role of orthographic input in second language German: Evidence from naturalistic adult learners’ production. Applied Psycholinguistics 36: 93.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB