Removing bias towards World Englishes: The development of a Rater Attitude Instrument using Indian English as a stimulus

Abstract

This study explores the attitudes of raters of English speaking tests towards the global spread of English and the challenges in rating speakers of Indian English in descriptive speaking tasks. The claims put forward by language attitude studies indicate a validity issue in English speaking tests: listeners tend to hold negative attitudes towards speakers of non-standard English, and judge them unfavorably. As there are no adequate measures of listener/rater attitude towards emerging varieties of English in language assessment research, a Rater Attitude Instrument comprising a three-phase self-measure was developed. It comprises 11 semantic differential scale items and 31 Likert scale items representing three attitude dimensions of feeling, cognition, and behavior tendency as claimed by psychologists. Confirmatory factor analysis supported a two-factor structure with acceptable model fit indices. This measure represents a new initiative to examine raters’ psychological traits as a source of validity evidence in English speaking tests to strengthen arguments about test-takers’ English language proficiency in response to the change of sociolinguistic landscape. The implications for norm selection in English oral tests are also discussed.

Keywords

Rater attitude rater judgement scale development validation World Englishes

Theoretical background

World Englishes

New lines of sociolinguistic research have acknowledged the pluricentricity of English such as world Englishes (WE) and English as lingua franca (ELF). The Kachru-led line of WE documents the function, status, linguistic maturity, and legitimacy of the emerging varieties especially in erstwhile UK- or US-colonized countries, such as India, and labeled as “outer-circle” contexts (Kachru, 1985). Despite a debate on whether it is a part of WE (Berns, 2008), ELF looks into the nature of English produced by non-native speakers, particularly without the involvement of inner circle speakers. The main research agenda of ELF focuses on phonology, pragmatics, and lexico-grammar (Jenkins, 2006; Siedlhofer, 2004), and reveals systematic and regular forms of English used by speakers of varieties of English (House, 1999). Research efforts put forth by WE and ELF have challenged traditional understandings of the ownership of English and the status of the native speaker norm (Davies, 2003; Widdowson, 1994).The features of users of English in different circle contexts have been thoroughly described and analyzed in The Handbook of World Englishes (Kachru, Kachru, & Nelson, 2006), Handbook of Varieties of English (Kortmann & Schneider, 2005), and The Oxford Guide to World Englishes (McArthur, 2003).

The spread of English and variations in different contexts raise important concerns that speakers in different circles, particularly in international situations, may become mutually unintelligible. The research agenda in intelligibility studies has expanded from a focus on speakers’ linguistic performance (Derwing & Munro, 2005; Field, 2005; Friederici, Kotz, Scott, & Obleser, 2010), listeners in co-sharing responsibility for intelligibility (Levis, 2005), the accommodation skills of speakers and listeners (Seidlhofer, 2004) to contexts without involvement of inner circle speakers, revealing strategies such as convergence and negotiation that help achieve mutual intelligibility (Deterding & Kirkpatrick, 2006; House, 2002, 2003; Meierkord, 2000; Firth & Wagner, 1997). The rich research findings not only complicate the constructs of intelligibility, but forward claims to assess English language learners’ ability to interact with speakers of different varieties of English instead of being judged against an idealized native speaker model (Canagarajah, 2006; Jenkins, 2006). The issues of approaches to evaluation about language learners’ proficiency also provoke debates in the field of language testing.

Such debates generally concern two different ideologies on the norms that test-takers’ performance in large-scale tests should be assessed against: the standard English perspective for the sake of fair measurement results (Elder & Davies, 2006; Elder & Harding, 2008) and the WE paradigm that advances a single norm ignores the linguistic richness in the current English global spread, and is biased against test-takers brought up on different norms (Canagarajah; 2006; Davidson, 2006; Lowenberg, 2002; Spolsky, 1993). Relevant to the latter is the reorientation of test constructs with reference to the target language use (TLU) domain. Construct-wide, as Abeywickrama (2013) argues, a single variety as test input misrepresents the TLU domain in a global context. Therefore, a more accurate representation of the TLU domain for tests used in a wider context is, as Taylor (2006) claims, to include the English varieties that test-takers may encounter in the TLU domain, and to assess test-takers’ ability on “how language diversity is actively negotiated in acts of communication under changing contextual conditions” (Canagarajah, 2006, p. 234).

Language attitude study

The emerging multiple forms of English have drawn the concerns of language testing professionals as potential biasing factors in language tests as raters’ acknowledgment and perception of the varieties may vary, which affect their scoring judgments (Davies, Hamp-Lyons, & Kemp, 2003). This assumption of an attitude–behavior relationship could be substantiated by research findings in the fields of psychology and language attitude, which have important implications in the language test practice should similar results be transferrable, and as such raise concerns about fairness in scoring validity, which is also a fairness issue (Kunnan, 2010). Empirical studies in language attitude have revealed that being identified as non-white in itself (Rubin, 1992) causes non-native speakers to be viewed unfavorably by listeners, regardless of whether they share the same variety (Lindemann, 2002; McKenzie, 2008; Rubin, 1992). The identification generates a negative evaluation of speaker competency (Lindemann, 2002; Rubin, 1992). Furthermore, the attitude–behavior relationship has also been established in psychology (Ajzen & Timko, 1986; Albarracin, Johnson, Fishbein, & Muellerleile, 2001; Fazio, Powell, & William, 1989; Hrubes, Ajzen, & Daigle, 2001). Despite these unfavorable judgments on competency, Giles and Billings (2004) in their extensive review of listener attitudes report that non-standard speech tends to be evaluated more highly in terms of “solidarity” when compared to varieties of standard speech. Non-standard variety speakers are generally rated highly on traits such as honesty and friendliness, particularly when the listeners are learners/speakers of a non-standard variety themselves. Studies on listeners from different ethnic groups also support this attitude and identity relationship (Barona, 2008; Bresnahan, Ohashi, Nebashi, Liu, & Shearman, 2002; Lindemann, 2003; McKenzie, 2008).

Language attitude is a complex construct. In psychology, the structure of attitude has been identified as comprising one or a combination of the three components of feeling, belief and behavioral tendency (Ajzen & Fishbein, 1980; Albarracin, Johnson, & Zanna, 2005; Cargile, Giles, Ryan, & Bradac, 1994). When attitude towards a specific language influences a listener’s judgment of a speaker, language attitude is formed (Cluver, 2000). To capture the complexity of language attitude formation in specific situations, a “social process model of language attitude” was proposed by Cargile et al. (1994). In this model, factors that evoke the formation of language attitude include characteristics of the speakers and listeners, and wider contextual factors, such a speaker’s culture as perceived by listeners, interpersonal history, immediate social situations or a combination of all.

Attitude studies within language testing

Unlike the constructs being conceptualized in the field of psychology, recent studies on rater attitude towards WE have revealed mixed and inconclusive findings. Kim (2005) examined the language backgrounds of raters, their attitudes toward WE, and how they scored the speech performance of six Korean students on the Test of Spoken English (TSE) picture description task, using holistic and analytic scales. Although their ratings on the holistic scales were fairly similar, their different attitudes towards WE significantly affected their analytic ratings on grammar, rate of speech, and task fulfillment, with those labeled as “positive” giving more lenient ratings.

Chalhoub-Deville and Wigglesworth (2005) investigated the rating performance by raters from the inner circle countries, including Australia, Canada, the UK and the US, and found no significant difference in evaluating ESL test-takers’ speaking performance. Similarly Kang (2008), replicating the study design of Rubin (1992), compared college student raters’ attitudes towards Asian and Caucasian groups, and noted no significant effect on their attitudes towards ethnicity, suggesting findings of similar studies may not be always applicable in different contexts. Touching issues on norm selection, Zhang and Elder (2011) compared native and non-native ESL/EFL teacher raters’ rating to national College English Test-Spoken English Test (CET-SET) samples. The results revealed similar outcomes in holistic scores between the two groups of raters, leading authors to imply “these norms may not be as distinct as is sometimes claimed” (p. 14).

Studies focusing on test-takers’ attitudes generally show a welcome embrace of linguistic diversity, but a reservation about bringing the difference to the test. Hamid (2014) examined test-takers’ viewpoints on inclusion of WE in the English oral test as elicited by IELTS data. The findings were mixed: though WE was favored by the majority on a conceptual level, the inclusion of WE-based linguistic features in tests was less supported for reasons of “maintaining standards, fairness, equality, and test-taker interests” (p. 277). A similar test-takers’ orientation to WE was also found in Gu and Yo’s (2014) study using four stakeholder groups, including test-takers, ESL/EFL teachers, score users, and language testing professionals. Looking at the test-takers’ attitude–test performance relationship, Harding (2008) showed that in general test-takers’ attitudes towards speakers with L2 accents of English on an academic listening test were reasonably positive, and that there was no clear relationship between the attitudes towards speakers and their performance on a listening test.

Overall, language assessment research on attitudes indicates a growing interest in placing rater attitude or psychological traits as potential variables affecting arguments on test-taker language proficiency. Nevertheless, the findings may not be generalizable because the studies covered different contexts and the instruments used to investigate rater attitudes were different. It therefore seems necessary to develop a unified instrument that is theoretically sound and empirically driven within language assessment research to measure rater attitudes towards WE.

Construction and validation of a rater attitude instrument

The scale development included three phases: construct elicitation, exploration, and verification. To generalize findings across a variety of English and contexts, the scale includes general statements about issues in rating WE speakers and specific statements on Indian English, the stimulus used in phase two and three. Except phase 1.1., different groups of raters of operational English oral tests were recruited for each phase of the study. Data collection over a 28-month period covered participants aged from 20 to 45 years, and ethical approval was obtained from the Institutional Review Board of the university with which the researcher was affiliated at the time of the study.

Construction phase 1.1: construct elicitation

To capture in fuller detail the features of the tripartite attitude constructs as asserted by psychologists (see aforementioned literature), two scaling methods were selected. Following a common practice in language attitude study, a seven-point semantic differential scale was selected to evaluate raters’ immediate feeling on speakers accompanied by their voices. To measure rater belief and rating tendency, declarative statements were necessary to depict and formulate various issues. As such, a five-point Likert scale (with values ranging from one = strongly disagree to five = strongly agree) was selected. An “un-ratable” option was included to allow any uncertainty in responding to items. Combining the two scales, a “Rater Attitude Instrument” (hereafter RAI) was created.

Sample and procedure

The construct elicitation was generated from two sources. For the Likert scale construction, three American male raters of a phone-based oral proficiency test were each interviewed either once or twice over the phone for approximately 40 minutes each time regarding their viewpoints of WE. A rich interview data set was theme-analyzed to screen out potential items for scale construction. Additional items were drawn from extensive literature review regarding stakeholder attitudes toward WE.

Next for the semantic differential scale items, a “varietal speaker evaluation” (see Hsu, 2012) was conducted, in which 40 undergraduate student raters participated, with a majority American (85%), from a large university in the US Midwest. Eight speech samples that include varieties of outer (including Indian English) and expanding circle English were used to collect an item pool representing raters’ immediate feelings towards WE. Student raters each provided up to three adjectives to describe their feeling of each speaker by completing the sentence, “The speaker sounds …”. A keyword analysis of the adjectives was calculated and adjective pairs with a higher distribution frequency were integrated and classified.

Construction phase1.2: item construction

Development of an item pool

Rater feeling

The 25 most frequent adjectives and their opposites in construction phase 1.1 were selected and randomly placed in the seven-point semantic differential scale (see Table 1), following the common practice of scale construction, that is, positive adjectives were not always placed on the right side of the scale.

Table 1.

Adjective-and-antonym pairs.

Clear/unclear	Certain/uncertain	Confident/unsure
Calm/nervous	Comfortable/uncomfortable	Difficult/easy
Enthusiastic/indifferent	Experienced/inexperienced	Fluent/not fluent
Friendly/unfriendly	Interesting/boring	Intelligent/unintelligent
Informative/unhelpful	Knowledgeable/uneducated	Kind/unkind
Good natured/disagreeable	Good pronunciation/unclear	Natural/unnatural
Organized/disorganized	Quick/slow	Rushed/easy
Sure/hesitant	Thoughtful/inconsiderate	Timid/happy
Talkative/quiet

Rater belief

Guided by the Social Process Model (Cargile et al., 1994), 60 items were developed and categorized to represent two of the attitude constructs: belief and behavior tendency. The hypothetical structures went through a verification examination in construction phase three.

Perceived culture factor. Eight statements concerning raters’ perceived cultural features were constructed. The questions referred to the role of the varieties in second language assessment and institutional and cross-cultural settings. Another five statements measured raters’ perception of WE in general that included their attitudes towards WE status, knowledge of WE speakers’ demographic strengths, and concerns about WE in ESL/EFL teaching and learning.

Expectation of Indian English. Language expectancy theory (Burgoon & Miller, 1985) claimed that discrepancy between expected and actual language use leads to negative evaluations of the speaker. As such, 11 items measuring raters’ expectation of Indian English were included.

Rater behavioral tendency

Rating tendency. Twenty-one items investigated raters’ handling of unfamiliar varieties and their role as active listeners. Items on their actual behaviors were omitted as they are influenced by factors other than attitude, and therefore are not always accurate indicators of attitudes (Mueller, 1986).

Interpersonal history. Fifteen items in this category were designed to evaluate the extent of raters’ exposure to and the degree of familiarity with the varieties.

Rater biasing factors

Apart from attitude constructs, five rater biasing factors reviewed in LT literature were included. These were rater educational and professional experience (Chalhoub-Deville, 1995), residency (Chalhoub-Deville & Wigglesworth, 2005; Kim 2005), rater nationality and native language (Brown, 1995), and gender (McKenzie, 2008; O’Loughlin, 2002), some of which contributed to the extraneous variables that affected the scores. Hence, this section seeks to identify which biasing factor is associated with rater attitudes and ultimately affects rater scoring decisions.

The 60 items were reviewed by a group of experts (N = 4) to maximize the content validity of the instrument (DeVellis, 2003). Included were two doctoral candidates specializing in second language assessment and two researchers with a background in second language acquisition and sociolinguistics. Items were reviewed for clarity, representativeness and comprehensiveness of the constructs and the possibility of bias. Here 15 Likert-scale items were further refined for clarity and conciseness.

Construction phase 2: exploration

Sample and procedure

A new group of 20 raters who participated in this phase of the study had more than six months of experience rating oral proficiency tests for international students seeking placement at their respective universities in the US Midwest. There were 15 female and five male participants with an average age of 35. Of these, 45% held their highest degree in TESOL, 20% in English literature and 35% in other areas. The majority of the participants were native English speakers (55%), 30% were Asians and 15% Russian.

Access was graciously granted to the IELTS speech samples by Cambridge English Language Assessment. Descriptive tasks extracted from the IELTS oral test component were used in phase two and three. The speech samples cover a range of IELTS score bands (i.e., band four, five, six, seven, eight, and nine) awarded to the six Indian test-takers from actual IELTS data. Indian English was used as the sole stimulus across three construction phases given that it has been documented as an established variety and figures much in WE research agenda (Kachru et al., 2006; McArthur, 2003). Each descriptive task lasted 90 seconds.

The RAI was administered on-line. The URL address to access the RAI, speech samples and consent form were sent out to the raters. Each rater provided ratings on the 25 semantic differential scale items for each of the six Indian test-takers, which yielded a total of 150 observations. Raters then proceeded to the Likert questionnaire measuring rater belief and rating tendency. Each scale was accompanied by a comment section to enable raters to provide feedback on qualities such as clarity. The time length for the entire study was approximately an hour, and each rater received $20 upon completion of the study.

Data analysis and results

Rater feeling

Descriptive statistics and internal consistency

The mean for the dataset is 4.84, and only three of the 25 items have a mean less than four. To test the assumption of univariate normality, skewness and kurtosis were checked. A more liberal recommendation on the acceptable levels as proposed by Kline (2005) was used where cutoffs of −3 to +3 for skewness and −10 to +10 for kurtosis were applied respectively. The skewness of the 25 semantic differential scale items ranged from −.984 to .477, and the values for kurtosis ranged from −1.479 to .906, indicating the responses were normally distributed and well within the liberal recommendation.

Cronbach’s Alpha was calculated to assess internal consistency. A value of .880 was obtained, indicating a high level of internal consistency (de Vaus, 2002; George & Mallery, 2003).

The examination of correlation matrix for item consistency revealed that several items had low correlations with most other items. These were evaluated again against other criteria when running factor analysis to verify whether the low correlations were spurious or alternatively helped to clarify the factor structure.

Exploratory factor analysis

A principal component analysis (PCA) was performed to obtain preliminary information on the underlying dimensions of rater feeling, that is, the latent factors representing the items in the scale. Ratings on the 25 semantic differential scale items were assessed for suitability using SPSS 15.0. Evaluation of the correlation matrix indicated that the data was factorable providing a Kaiser-Meyer- Oklin index of .856, which is “meritorious” (Pett, Lackey, & Sullivan, 2003). Bartlett’s test of Sphericity was significant (p = .0000), indicating that the correlation matrix was not an identity matrix and all measures of sampling adequacy were deemed appropriate for data analysis.

Factor extraction and rotation

The PCA with oblimin rotation method was conducted. The choice of oblimin rotation method is based on the assumption that items or factors of rater feeling are most likely correlated to some degree (cf. Pett et al., 2003).The number of factors to extract was determined according to the two criteria: results of the scree plot and eigen values greater than one (Hayton, Allen, & Scarpello, 2004). Inspection of the scree plot seemed to suggest three or four factors, whereas the eigen values suggested five initial factors. The PCA was conducted again to force extractions of only four and three components respectively. Criteria that determined the acceptable number of the factor included: (1) items load substantively (> .30) on only one factor, (2) items load at approximately zero (10 to −0.10) on some other factor (Tabachnick & Fidell, 1989) and (3) interpretability. That is to say, the ultimate decision on the number of factors to extract was based on simple structure and the interpretative clarity of the loadings. As a result, the three factor model attained satisfactory results and was selected. Next, each item was evaluated for possible removal so as to maximize the explained variance. Item communality that measured the proportion of variance of a particular item that is explained by all the factors jointly was used as a guideline for item deletion (Worthington & Whittaker, 2006). Any item with a communality of less than 0.50 was removed because it was not highly correlated with one or more of the factors in the solution (Costello & Osborne, 2005). Consequently, six items/pairs were removed, and PCA was carried out two more times on the remaining items/pairs until the complete item communality was improved and above .50 when six more items/pairs were removed.

The three factors extracted accounted for 68.28% of the total variance in the items with eigen values greater than one (see Table 2). Note that only positive adjectives in each pair were listed in the table. Factor one was generally evaluated by voice quality, factor two by attractiveness of the speaker, and factor three by speaker’s competence. Given the easy interpretability, two of the items in factor one (Intelligent and Knowledgeable) with a loading of .752 and .670 seem to have a better fit in factor three implying a speaker’s competency. Sure, with a loading of .612 may be regrouped into factor three from factor one. The original and alternative proposed item distributions (see Table 3) facilitated constructing a CFA model in the next phase. Items in italics in model two are the alternative changes.

Table 2.

Results of principal component analysis.

	Factor loadings
	1	2	3
Clear	.853	.091	−.138
Good pronunciation	.840	−0.18	−.026
Intelligent	.752	.268	−.372
Fluent	.731	−0.34	−.575
Knowledgeable	.670	.312	−.519
Good natured	.191	.868	−.252
Kind	.011	.865	−.134
Considerate	.136	.865	−.321
Talkative	.214	.271	–.807
Quick	.105	.100	–.782
Sure	.612	.027	–.710
Experienced	.336	.430	–.710
Informative	.159	.504	–.668
Eigen values	4.886	2.406	1.585
% of variance accounted for	37.587	18.505	12.192
Cronbach’s Alpha	.839	.851	.798

Table 3.

Two models for confirmatory factor analysis.

Model one (current PCA results)
Factor	1	Clear, Good pronunciation, Fluent, Intelligent, Knowledgeable
	2	Good natured, Kind, Considerate
	3	Talkative, Quick, Sure, Experienced, Informative
Model two (alternative model based on interpretability)
Factor	1	Clear, Good pronunciation, Fluent, Sure
	2	Good natured, Kind, Considerate
	3	Talkative, Quick, Experienced, Informative, Intelligent, Knowledgeable

For the final 13 items on rater feeling, Cronbach’s Alpha was above .80 for the total scale and each factor demonstrated an acceptable degree of internal consistency. The final three-factor model accounted for 68.284% of the total variance.

Rater belief and rating tendency

Negatively worded items were reverse coded prior to the analysis so that higher scores indicated a more positive belief or rating tendency. Cronbach’s Alpha was calculated to examine internal consistency, that is, whether the scale items all measure the same underlying attributes. The reliability estimates for the variables range from .260 to .557 with Cronbach’s Alpha of .609 for the overall measure. As it is recommended that a minimum Cronbach’s Alpha of .70 is needed to demonstrate good internal consistency (de Vaus, 2002; George & Mallery, 2003), all items were re-examined to determine if they placed below the desirable value. The alphas of items deleted along with the qualitative input provided by the raters were examined and 13 problematic items across sections were removed to improve questionnaire clarity, resulting in 32 items remaining in the revised scale. Cronbach’s Alpha for the entire questionnaire was improved to .738. As illustrated in Table 7, the Cronbach’s Alpha for each variable was also improved, though only the Expectation of Indian English section satisfied the .70 cutoff value. The other three sections along with their new Cronbach’s Alpha were as follows: Rating Tendency (.597), Perceived Cultural Factor (.590), and Interpersonal History (.457).

Construction phase 3: verification

Sample and procedure

A fresh sample of 96 raters participated in phase three, comprising ESL/EFL instructors at private or university ESL programs in New York City, San Francisco, and India respectively and resided in the areas above at the time of the study. There were 13 Indian and 83 non-Indian participants, comprising 90% Caucasian, 8% Asian, and 2% African. Twenty-three of the raters had experience in rating large-scale English proficiency tests and more than half (54%) had more than six years of ESL/EFL teaching experience with 75% possessing a highest degree in TESOL, 9% in education, and 10% in other areas.

The procedures were consistent with those in phase two, allowing raters to access all required materials, including the RAI, speech samples and consent form.

An email reminder was sent two weeks after the respondents received the link to see if they had not yet completed the study. The time length for the entire study was approximately one hour, and each respondent received $15 for participating.

Data analysis and results

Rater feeling

Descriptive statistics and internal consistency

The mean, standard deviation, skewness, and kurtosis were computed (see Table 4). Note that only positive adjectives in each pair were listed in the table. The mean score was 5.13 which provided some implications of raters’ positive feelings of the Indian speakers. The assumption of univariate normality was met as shown in the skewness and kurtosis indexes.

Table 4.

Means and standard deviations for feeling attributes.

Pair	Mean	Standard deviation	Skewness	Kurtosis
Clear	4.97	1.644	−.418	−.969
Experienced	4.96	1.485	−.468	−.488
Intelligent	5.23	1.373	−.448	−.490
Quick	4.87	1.233	−.233	−.302
Educated	5.30	1.271	−.281	−1.061
Kind	5.34	1.121	−.657	.020
Fluent	5.12	1.637	−.605	−.449
Good natured	5.54	1.114	−.490	.357
Considerate	5.38	1.178	−.309	.028
Talkative	4.94	1.365	−.647	−.012
Good pronunciation	4.70	1.629	−.503	−.908
Sure	5.05	1.513	−.402	−.356
Informative	5.29	1.428	−.298	−.540

Cronbach’s Alpha was .904 for the semantic differential scale, well above the recommended .70 cutoff for good internal consistency reliability (de Vaus, 2002; George & Mallery, 2003).

Confirmatory factor analysis (CFA)

To further provide evidence of construct validity of the measure of rater feeling, CFA was conducted using AMOS 7.0. As the current set of data contained 36 missing values, consideration to impute the missing data was given in case the modification indices (MI) were used to improve the model fit (Byrne, 2001). The data needs to be completed without missing values in order to examine the MI. The expectation maximizing (EM) algorithm was performed to impute missing data as recommended by Schafer and Graham (2002) to minimize bias when only a small amount of missing data occurred. As a result, the full 596 sample size was retained for further CFA.

Two EFA models were evaluated by CFA. The three latent variables were the three factors identified by the previous EFA. The 13 observed variables were the items measuring rater feeling. According to the fit indices, neither model fit the data adequately. An examination of the squared multiple correlation explained that the variances accounted for by each of the 13 items revealed that two of the items, Talkative and Quick, in each model was problematic due to low variance. These two items were then removed in each model and CFAs were re-run. Table 5 provides a summary of CFA goodness-of-fit indices by analysis for the two models. The fit indices for both models show that chi-square was statistically significant for both, while other indices suggested an inadequate fit of the models to the data (Model one: χ ² = 325.900, df = 41, p =.000; Model two: χ ² = 198.208, df = 41, p = .000). Other fit indices were examined to determine the best model. As shown in Table 5, the fit indices for model two yielded the better model fit and met the cutoff criteria for acceptable levels.

Table 5.

Summary of CFA goodness-of-fit indices for the two priori models.

Model	χ²	df	p	RMSEA	CFI	TLI
One (original)	325.900	41	.000	.110	.926	.901
Two (alternative)	198.208	41	.000	.082	.959	.945

Note: RMSEA = root mean-square error of approximation. CFI = Comparative Fit Index. TLI = Tucker-Lewis Index.

The Comparative Fit Index (CFI; Bentler, 1990) value was 0.959, the TLI was 0.945, and the RMSEA of 0.082 was within the recommended range of model fit (Byrne, 2001). The chi-square difference between the two models is 127.692, indicating a significant improvement (p < .001) in model fit. Thus, the results suggest model two better fits the data.

Factor one, labeled “Speech Competency”, contained four items relevant to speech performance. Factor two was composed of three qualities related to kindness that reflected the speakers’ attractiveness for listeners, and was labeled “Kind-heartedness”. Factor three included four items related to a speaker’s intelligence, as perceived by listeners, and was thus labeled “Intelligence”.

There was moderate correlation between the three factors. The highest correlation was between “Speech Competency” and “Intelligence” (r = .899) followed by “Intelligence” and “Kind-heartedness” at a moderate .481 while the lowest was between “Speech Competency” and “Kind-heartedness” (r = .284). The factor loadings were moderately high, ranging from .624 to .937. The largest and lowest coefficients (i.e., Considerate and Kind) were presented by the two indicators concerning a speaker’s kind-heartedness.

Reliability estimate

Based on the results of the CFA, reliability analyses were run on the final 11 items. Cronbach’s Alpha for the reconstructed 11-item instrument was 0.897 versus 0.904 for the original 13 items, which is considered satisfactory.

Rater belief and rating tendency

Descriptive statistics and test of normality

Table 6 presents raters’ responses to the items. The un-ratable option is not shown here, as it constitutes a small portion of responses to each question. Eight missing data points were detected and mean substitutions used to replace them. The normality assumption using skewness and kurtosis indices were inspected. As before, the acceptable range for normality is an absolute value of skewness index lower than three, and kurtosis index absolute value lower than 10 (Kline, 2005).

Table 6.

Raters’ responses to Likert scale items (%).

Belief: Expectation of Indian English		SD	GD	N	GA	SA
1	I have no problem understanding Indian speakers in non-test situations.	2.1	33.0	16.5	39.2	8.2
2	Indian English is a steady variety that has its own linguistic features.	1.0	5.2	9.3	48.5	35.1
3	I have experience in rating Indian examinees.	37.1	12.4	13.4	10.3	25.8
4	Indian speakers may be treated as native speakers of English nowadays.	10.3	21.6	24.7	34.0	8.2
5	Indian speakers should not be exempted from English proficiency tests.	4.1	5.2	21.6	36.1	29.9
6	I need to make more effort to understand Indian examinees.	6.2	22.7	25.8	29.9	12.4
Belief: Perceived cultural factors
1	Standard English (e.g., American English) should be used to judge examinees’ performance in the test setting.	4.2	12.5	27.1	41.7	11.5
2	Varieties of English are not appropriate to use in cross- cultural communication.	32.3	37.5	15.6	9.4	4.2
3	Native speakers of English do not best serve as raters of oral English test (e.g., TOEFL, IELTS).	31.3	37.5	21.9	6.3	3.1
4	Varieties of English are not appropriate in everyday communication.	56.3	27.0	7.3	7.3	2.1
6	Language learners should develop an awareness of the global spread of English.	0.0	1.0	8.3	35.4	55.2
7	Unless varieties of English are promoted via educational efforts, such as by being codified in the dictionary, they can’t obtain legal status and become standard.	5.2	13.5	35.4	32.3	12.5
8	Language learners should be exposed to different varieties of English.	1.0	8.3	9.4	37.5	42.7
9	Native speakers of English do not best serve as English language teachers.	34.4	39.6	16.7	9.4	0.0
10	Speakers of non-standard varieties (i.e., not British or American English) currently outnumber native speakers of standard English.	1.0	10.4	27.1	33.3	27.1
12	Raters of speaking tests (e.g., TOEFL, IELTS) should have opportunities to be exposed to varieties of English during training.	3.1	2.1	4.2	27.1	62.5
13	Raters of speaking tests (e.g., TOEFL, IELTS) should develop an awareness of the global spread of English.	0.0	1.0	4.2	26.0	68.7
Rating Tendency		SD	GD	N	GA	SA
1.	The differences between standard English and varieties of English are creative and as correct as standard English.	7.23	27.1	18.8	32.3	12.5
2	Examinees do not need to speak like a native speaker in order for me to assign high scores.	6.2	10.3	9.3	45.4	25.8
3	I do not grade down examinees that speak a variety, as long as they express themselves well.	0.0	2.1	21.9	50.0	25.0
4	I do not penalize examinees who use negotiation strategies (e.g., asking for clarification, rephrasing).	2.1	13.5	41.7	39.6	3.1
5	When examinees use less familiar expressions, it suggests that they have not fully mastered English yet.	16.7	40.6	15.6	19.8	5.2
6	The rater is not responsible for examinees’ intelligibility.	4.2	16.7	31.3	30.2	16.7
7	I give high scores to examinees that use expressions as used by the native speakers of English.	2.1	5.2	25.0	45.8	20.8
Rating tendency: Interpersonal history
1	I am comfortable listening to varieties of English.	1.0	2.1	5.2	40.6	51.0
2	I can’t communicate well with people who speak a variety different from mine.	51.0	41.6	2.1	3.1	2.1
3	The use of varieties can cause cross-cultural misunderstandings.	3.1	12.5	15.6	50.0	18.8
4	English has evolved into different steady varieties	0.0	5.2	7.3	44.8	40.6
5	Features of varieties are developed in the same way as American English developed from British English.	9.4	7.3	15.6	37.5	26.0

Note: SD = strongly disagree, GD = general disagree, N = neutral, GA = generally agree, SA = strongly agree.

Internal consistency

Table 7 shows the reliability estimate for the total Likert scale and each subscale. Cronbach’s Alpha of .602 for the total scale shows somewhat acceptable internal consistency of the items (de Vaus, 2002; George & Mallery, 2003).The reliability estimate for each subscale in the current phase is generally lower than in the exploratory phase, except for the last subscale, Interpersonal History, which improved from .457 to the current .518. Cronbach’s Alpha scores for other sub-scales are as follows: Expectation of Indian English (.474), Perceived Cultural Factor (.383), and Rating Tendency (.361). Reasons for low reliability were most likely the small number of items in the current sections. Other potential reasons could be the wider range of difficulty of items (Symonds, 1928). In the current study, this could be explained by the fact that raters’ beliefs in WE and rating tendency greatly differed from each other, which led to low reliability. To improve internal consistency, the “alphas of items deleted” was checked, which suggested that removing item 24 would improve the alpha to .628 for the total scale. Thus, this item was discarded.

Table 7.

Reliability coefficients.

	Variables	Cronbach’s Alpha in second phase	Cronbach’s Alpha in third phase
Belief	Expectation of Indian English	.726	.474
	Perceived cultural factor	.590	.383
Rating tendency	Rating tendency	.597	.361
	Interpersonal history	.457	.518
	Overall	.738	.602

Establishing a confirmatory factor model for RAI

Up to this phase, the construction and data analysis for the RAI were conducted in separate measures representing different dimensions of the attitude constructs. Attempts were made to bring the conceptualized three-dimensional structure of the RAI together and evaluate whether a confirmatory factor model could be established. Towards this end, the minimum number as required to perform CFA needs to be satisfied (Bryne, 2001). The current two measures that constitute the RAI include scores deriving from the two different point scales and need to be standardized to ensure that observations can be compared on a like basis. As such, summated subscale scores were used where each subscale was treated as an indicator (e.g., perceived cultural factor) and the three conceptualized components of attitude construct were treated as latent factors (e.g., belief).This resulted in seven indicators for 96 observations, which is close to the minimum sample size requirement for running factor analysis (MacCallum, Widaman, Zhang, & Hong, 1999). The conceptualized RAI model can be visualized in Figure 1. Next, all subscale scores were standardized by dividing them with their respective perfect scores and multiplied by 100 to yield the proportional scores. Then the weight of each conceptualized tripartite attitude component was added by allocating one-third of the total RAI score. The following equation shows how each rater’s RAI score was calculated:

RAI composite score = (SC + I + KH) * 1 / 3 + (EIE + PCF) * 1 / 3 + (RT + IH) * 1 / 3

Figure 1.

A hypothetical three-factor model of the RAI.

CFA was conducted using AMOS 6.0. As shown in Table 8, all the indices indicate that the model did not fit the data well (see recommendations in Hu & Bentler, 1999). Inspection of the modification indexes, standardized residuals, and factor loadings indicated that a better fit could be obtained by allowing one error covariance (i.e., expectation of Indian English and interpersonal history) to be correlated. The fit indices improved significantly (χ²= 16.559, p = .167, RMSEA = 0.063, CFI = 0.968, TLI = 0.944) leading the data to better fit into the model.

Table 8.

Goodness-of-fit indices for one-, two-, and three-factor measurement models before vs. after modification.

n-factor model	χ²	df	p	RMSEA	CFI	TLI
Three factor Before	22.311	13	.000	.087	.934	.894
After	16.559	12	.167	.063	.968	.944
Two factor Before	26.965	14	.019	.099	.916	.874
After	20.052	13	.094	.076	.954	.926
One factor Before	27.184	14	.018	.101	.904	.855
After	17.848	13	.163	.063	.966	.945

Note: RMSEA = Root mean-square error of approximation. CFI = Comparative Fit Index. TLI = Tucker-Lewis Index.

However, the correlation between belief and rating tendency was strong (r = .925), suggesting these two factors may actually be represented by a single factor. As such, the two-factor model was tested, placing “rater belief” and “rating tendency” together as one latent factor. Raters’ RAI composite scores were re-calculated by weighing each latent factor equally and applying the equation below:

RAI composite scores = (SC + KH + LC) * 1 / 2 + (EIE + PCF + RT + IH) * 1 / 2

The initial fit statistics for the two-factor model did not meet the standards of a well-fitting model. A modest, though not significant, improvement on fit indices was obtained by correlating the error covariance (i.e., expectation of Indian English and interpersonal history) (χ ²= 20.052, p = .094, RMSEA = 0.076, CFI = 0.954, TLI = 0.926). The factor loadings of indicators on latent factor of feeling were strong, ranging from .750 (i.e., kind-heartedness) to .932 (level of confidence). Other factor loadings were either low or moderate, with the last indicator, interpersonal history, loading negatively on its respective latent factor. The factor correlation between two latent factors was .164.

Before determining the best model fit, an attempt was made to seek the feasibility of a one-factor measurement model because literature in attitude research also proposes one dimension of attitude construct. The initial fit statistics for the one-factor model was not considered a good fit. An acceptable fit value was obtained after model modification (χ²= 17.848, p = .163; CFI = .966; TLI = .945; RMSEA = .063). The factor loading of each indicator after modification on its respective factor ranged from −0.197 (expectation of Indian English) to 0.922 (level of confidence). Two negative factor loadings apparently add to the difficulty in interpreting the relationship between the latent factor and indicators.

To determine the best model that fit the data, the χ² difference test was conducted between each of the two models. This yielded critical values on and above 0.062, suggesting that none of the models provides a significantly best fit to the data. Thus, the selection of the best model had to be determined according to interpretability. The three-factor measurement model established the conceptualized tripartite attitude construct into a confirmatory factor model. All indicators had moderate to strong factor loadings on their respective primary factor. The factor correlation between two of the factors, belief and rating tendency, however, was very high, suggesting the factor overlap, an indication that the items in these two factors may need revision to avoid duplication. In terms of the one-factor model, the results of analysis support the literature of unified attitude construct. Nevertheless, the CFA results suggest the presence of indicators (i.e., Expectation of Indian English and Perceived Cultural Factor) loaded negatively on the latent factor, which may cause interpretation difficulty. The two-factor model supports the multi-dimensional attitude construct, while avoiding factor redundancy, as shown in the three-factor model. Thus, in comparing the three models, the two-factor measurement model appears to best represent the constructs of rater attitude towards WE for the current data set (see Figure 2).

Figure 2.

Measurement structure of the RAI.

Discussion and implications

The RAI is an empirically and theoretically grounded means of self-measuring a rater attitude towards Indian English and WE in general. Evidence from psychometric analysis of data supports a two-factor rater attitude measurement model that includes the two dimensions of rater feeling and rater belief. As such, the objective of identifying and establishing a psychometrically driven structure of rater attitude model within the language assessment context was achieved. The RAI represents a new initiative to examine raters’ psychological traits as a source of validity evidence in English speaking tests to strengthen arguments about Indian test-takers’ English language proficiency. Furthermore, the RAI appears to hold the potential of research and diagnostic utility for testing agencies and professionals interested in monitoring and evaluating attitudes held by raters of oral tests and to uncover whether the attitude–behavior tendency relationship, as attested in language attitude literature, is held true in the testing situation by comparing raters’ RAI and actual rating scores to Indian test-takers. The finding may be used to guide the design for the rater training program.

The RAI results reveal complexities of the relevance of WE to English oral tests. Although a general pattern emerging from the data showed raters’ endorsing linguistic diversity, raters’ preferences on norm selection, one of the controversial issues in WE and LT literature, do not reveal a clear pattern (i.e., Belief: PCF1 of the RAI). Slightly more than half of the raters (53.2%) indicate a preference of standard English to judge test-takers’ spoken English, and nearly one-third of the respondents (27.1%) express uncertainty about this selection. Raters’ preference skewed towards standard English resonates with recent literature about test-takers’ attitudes towards WE (Abeywickrama, 2013; Gu & So, 2014; Hamid, 2014). On the other hand, a dominant portion (71.2%) shows raters’ rating tendency or decisions are not based on the extent to which test-takers speak like a native speaker (i.e., Rating Tendency 2 of the RAI), and raters do not grade down test-takers who speak varieties providing they express themselves well (75%). The contrasting finding is similar to the behavioral paradox in language revitalization, due to a mismatch between ideology and practice commitment (Eggington, 2010). That is, one may feel obliged to do something (i.e., preference for standard English), but do the opposite in practice (i.e., a rating that does not follow native speaker norms).

Similar to test-takers’ concerns about being disadvantaged if non-standard varieties are used in the test due to unfamiliarity, raters’ preference of standard English may be attributed to their unfamiliarity with varieties leading to uncertainty in scoring decision. For monologue tasks, allowing varieties gives raters no opportunities to clarify questionable responses. Tasks that require interlocutor and test-taker interaction may suggest a more active involvement and greater listener effort in negotiating intended meanings (Canagarajah, 2006; Elder & Davies, 2006; Jenkins, 2006). To what extent a fair scoring result can be achieved challenges, therefore, interlocutors’ willingness and interviewing styles to engage in meaningful interaction, which may affect test-takers’ performances and scores. Raters’ preferences for standard English may be explained by their concern for fairness in scoring, which is less about dynamic negotiation, but more about comparing the discrepancy between the native speaker model and test-taker performance. Another factor put forth by Elder and Harding (2008) concerning the appeal of standard English among stakeholders is the widely held perception that standard English is more prestigious for teaching and testing purposes.

In contrast, raters’ tendencies as revealed in the RAI, indicating test-takers’ successful communication and task fulfillment regardless of varieties seem an underlying factor for the scoring decision. This is consistent with ELF’s proposition that one needs negotiation skills “to shuttle between different varieties of English and different speech communities” (Canagarajah, 2006, p. 233). Some ELF literature has documented that speakers’ communication and negotiation strategies in simulated conversational or telephone data (House, 2003) resort to topic changes or “let-it-pass principle” on the assumption that misunderstanding does not impede the ongoing and successful conversation. Further research may explore whether these strategies are applicable to raters in the testing situation and, if so, what linguistic and non-linguistic features are “passed” and what features are decisive in scoring decisions. This will help testing communities understand how much varietal features impact oral test scores, how English oral proficiency should be defined in response to the TLU domain in a new geopolitical context, and whether the norm selection debate is held relevant in English oral tests when test-takers’ communication strategies are endorsed by raters.

In terms of the RAI structure, further item modification on the five-point Likert scale measuring rater belief may be needed to increase validity of the RAI. Following the common practice of scale construction, more items than currently remain could be deleted owing to low alpha. Nevertheless, being among the first instruments to measure rater attitude towards Indian English and WE in general in language assessments, the RAI aims for comprehensiveness of concerns addressed by raters, testing professionals and the literature. Newton and Asimakopoulou (2009) argue that “measures of internal consistency give information on reliability, not validity” (p. 580). As such, items that covered the different dimensions of rater attitude constructs were not sacrificed as a result of internal consistency checks. On the contrary, the low alpha in the sections provides valuable indications of rater uncertainty about the questions, and shows the need for further study to unveil related issues.

The two-factor rater attitude measurement model has valuable implications. First, the dimension of rater feeling presents a mix of consistent and contradictory findings to previous language attitude research. Factor two, “Kind-Heartedness”, is in many ways similar to other language attitude studies (Carranza & Ryan, 1975; Ryan, Carranza, & Moffie, 1977) displaying overlapping items. It suggests the concern of raters for the attractiveness of speakers, leading to a broader dimension of evaluating the quality of speech. The third factor, Intelligence, includes items from Zahn and Hopper’s (1985) superiority factor, and consists of elements such as “educated”, and “experienced”, and display rater perception of a speaker’s social status and intellectual achievement. This factor was well correlated (r = .899) with the first factor, “Speech Competency”, suggesting that it is aligned with raters’ feelings towards the level of education of test-takers.

Some results that contrasted with previous language attitude research were observed. Traits in factor one, “Speech Competency”, and factor three, “Intelligence”, labeled as “Superiority” in Zahn and Hopper (1985) are linked to the standard variety in the previous study. Nevertheless, these two factors were rated favorably in the current study when Indian English was used as a stimulus. These contrasting findings may be attributed to two reasons. First, as opposed to the predominant use of undergraduate students in language attitude study, this study involved raters of ESL/EFL teachers who may have greater awareness of WE. Second, most raters were located in metropolitan areas with greater exposure to diverse language learners. Both may lead to more accepting views and feelings towards test-takers of Indian English. To consolidate the current findings, contrasting varieties between outer- and expanding-circle varieties may be used simultaneously to compare responses to traits related to standard and “non-standard” variety, as claimed by previous attitude studies.

Limitations and implications for future research

The choice of stimulus is among the limitations of this study. As only Indian English was used, the results need to be interpreted with caution, as they may not apply to attitudes towards test-takers of other WE varieties, although items relevant to Indian English constitute a small portion of the RAI. Extending the current study using alternative stimuli, such as single outer- or expanding-circle varieties or a combination, would also provide insights into the generalizability of these findings. Another limitation is the role descriptive tasks as an elicitation stimulus of rater attitude and rating performance. Given that task types may affect test scores (Chalhoub-Deville, 1995; Iwashita, McNamara, & Elder, 2001; Wigglesworth, 2001), further research may focus on interaction-oriented speech tasks or a combination of different task types to seek comparable results and broaden understanding of how test-takers’ communication strategies are perceived in relation to raters’ attitude towards WE and, most importantly, to what extent that attitude matters in the scores awarded. This will also clarify whether the interaction strategies as highlighted in the cross-cultural communication by WE researchers are indeed relevant in the testing situation.

Footnotes

Acknowledgements

This paper is based on my PhD dissertation, completed at University of Illinois at Urbana-Champaign. I would like to thank the dissertation advisor, Fred Davidson for his guidance and support throughout the study. I would also like to thank Nick Saville, Gad Lim, April Ginther and Berlitz International Incorporated who helped facilitate the data collection process, and the two anonymous reviewers for LT who made insightful and useful comments.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by TOEFL small grants for doctoral research in second or foreign language assessment, Educational Testing Service, in 2010.

References

Abeywickrama

(2013). Why not non-native varieties of English as listening comprehension test input? RELC Journal, 44, 59–74.

Ajzen

Fishbein

(1980). Understanding attitudes and predicting social behavior. Englewood Cliffs, NJ: Prentice Hall.

Ajzen

Timko

(1986). Correspondence between health attitudes and behavior. Basic and Applied Social Psychology, 7(4), 259–276.

Albarracin

Johnson

B. T.

Fishbein

Muellerleile

P.A.

(2001). Theories of reasoned action and planned behavior as models of condom use: A meta-analysis. Psychology Bulletin, 127, 142–161.

Albarracin

Johnson

B. T.

Zanna

M. P.

(Eds.) (2005). Handbook of attitudes. Mahwah, NJ: Lawrence Erlbaum.

Barona

D.B.

(2008). Native and non-native speakers’ perceptions of non-native accents. Language and Literature Journal, 3(2). Retrieved March 20, 2011 from http://ojs.gc.cuny.edu/index.php/lljournal/article/viewArticle/430/428.

Bentler

P.M.

(1990). Comparative fit indexes in structural models. Psychological Bulletin, 107, 238–46.

Berns

(2008). World Englishes, English as a lingua franca, and intelligibility. World Englishes, 27(3/4), 327–334.

Bresnahan

M. J.

Ohashi

Nebashi

Liu

W. Y.

Shearman

S. M.

(2002). Attitudinal and affective response toward accented English. Language and Communication, 22,171–185.

10.

Brown

(1995). The effect of rater variables in the development of an occupation-specific language performance test. Language Testing, 12(1), 1–15.

11.

Brown

(2000). An investigation of rater’s orientation in awarding scores in the IELTS interview. In Tulloch

(Ed.), IELTS research reports, 3 (pp. 1–19).

12.

Brown

Iwashita

McNamara

(2005). An examination of rater orientations and test taker performance on English for academic purposes speaking tasks. (Monograph Series MS-29). Princeton, NJ: Educational Testing Service.

13.

Burgoon

Miller

G. R.

(1985). An expectancy interpretation of language and persuasion. In Giles

St. Clair

R. N.

(Eds.), Recent advances in language communication and social psychology (pp. 199–229). London: Lawrence Erlbaum.

14.

Byrne

B. M.

(2001). Structural equation modeling with AMOS. Mahwah, NJ: Lawrence Erlbaum.

15.

Canagarajah

(2006). Negotiating the local in English as a lingua franca. Annual Review of Applied Linguistics, 26, 197–218.

16.

Cargile

A. C.

Giles

Ryan

E.B.

Y Bradac

J.J.

(1994). Language attitudes as a social process: A conceptual model and new directions. Language & Communication, 14, 211–236.

17.

Carranza

Ryan

(1975). Evaluation reactions of bilingual Anglo and Mexican-American adolescents toward speakers of English and Spanish. International Journal of the Sociology of Language, 6, 83–104.

18.

Chalhoub-Deville

(1995). Deriving oral assessment scales across different tests and rater groups. Language Testing, 12(1), 16–33.

19.

Chalhoub-Deville

Wigglesworth

(2005). Rater judgment and English language speaking proficiency. World Englishes, 24(3), 383–391.

20.

Cluver

A.D.

(2000). Changing language attitudes: The stigmatization of Khoekhoegowap in Namibia. Language Problems and Language Planning, 24(10), 77–100.

21.

Costello

Osborne

(2005). Best practices in exploratory factor analysis: Four recommendations for getting the most from your analysis. Practical Assessment, Research and Evaluation, 10(7), 1–9.

22.

Davidson

(1993). Testing English across cultures: Summary and comments. World Englishes, 13(1), 113–115.

23.

Davidson

(2006). World Englishes and test construction. In Kachru

Kachru

Nelson

(Eds.), The handbook of world Englishes (pp. 709–717). Malden, MA: Blackwell.

24.

Davies

(2003). The native speaker: Myth and reality. Clevedon: Multilingual Matters.

25.

Davies

Hamp-Lyons

Kemp

(2003). Whose norms? International proficiency tests in English. World Englishes, 22(4), 571–584.

26.

Derwing

Munro

(2005). Second language accent and pronunciation teaching: A research-based approach. TESOL Quarterly, 39(3), 379–397.

27.

Deterding

Kirkpatrick

(2006). Emerging South-East Asian Englishes and intelligibility. World Englishes, 25(3), 391–409.

28.

De Vaus

(2002). Analyzing social science data: 50 key problems in data analysis. Los Angeles, CA: SAGE Publications.

29.

DeVellis

R. F.

(2003). Scale development: Theory and applications. Thousand Oaks, CA: SAGE Publications.

30.

Eggington

(2010). Towards accommodating the “tragedy of the commons” effect in language policy development. Current Issues in Language Planning, 11(4), 360–370.

31.

Elder

Davies

(2006). Assessing English as a lingua franca. Annual Review of Applied Linguistics, 26, 282–301.

32.

Elder

Harding

(2011). Language testing and English as an international language: Constraints and contributions. Australian Review of Applied Linguistics, 34(1–34), 11.

33.

Fazio

R.H.

Powell

M.C.

Williams

C.J.

(1989). The role of attitude accessibility in the attitude-to-behavior process. Journal of Consumer Research, 16(3), 280–288.

34.

Field

(2005). Intelligibility and the listener: The role of lexical stress. TESOL Quarterly, 39(3), 399–423.

35.

Firth

Wagner

(1997). On discourse, communication, and (some) fundamental concepts in SLA research. Modern Language Journal, 81, 285–300.

36.

Friederici

A. D.

Kotz

S. A.

Scott

S. K.

Obleser

(2010). Disentangling Syntax and Intelligibility in Auditory Language Comprehension. Human Brain Mapping 31, 448–457.

37.

George

Mallery

(2003). SPSS for Windows step by step: A simple guide and reference. 11.0 update (4th ed.). Boston, MA: Allyn & Bacon.

38.

Giles

Billings

A. C.

(2004). Assessing language attitudes: Speaker evaluation studies. In Davies

Elder

(Eds.), The handbook of Applied Linguistics (pp. 187–209). Malden, MA: Blackwell.

39.

(2014). Voices from stakeholders: What makes an academic English test “international”? Journal of English for Academic Purposes, 18, 9–24.

40.

Hamid

M. O.

(2014). World Englishes in international proficiency tests. World Englishes, 33(2), 263–277.

41.

Harding

(2008). The use of speakers with L2 accents in academic English listening assessment: A validation study. Unpublished doctoral dissertation. The University of Melbourne, Australia.

42.

Hayton

J. C.

Allen

D. G.

Scarpello

(2004). Factor retention decisions in exploratory factor analysis: A tutorial on parallel analysis. Organizational Research Methods, 7, 191–205.

43.

House

(1999). Misunderstanding in intercultural communication: Interactions in English as a lingua franca and the myth of mutual intelligibility. In Gnutzmann

(Ed.), Teaching and learning English as a global language (pp.73–89). Tubingen: Stauffenburg.

44.

House

(2002). Pragmatic competence in lingua franca English. In Knapp

Meierkord

(Eds.), Lingua franca communication (pp. 245–267). Frankfurt: Peter Lang.

45.

House

(2003). English as a lingua franca: A threat to multilingualism. Journal of Sociolinguistics, 7(4), 624–630.

46.

Hrubes

Ajzen

Daigle

(2001). Predicting hunting intentions and behavior: An application of the theory of planned behavior. Leisure Sciences, 23(3), 165–178.

47.

Hsu

H.L.

(2012). The impact of World Englishes on language assessment: Perception, rating behavior and challenges. Unpublished doctoral dissertation. University of Illinois at Urbana-Champaign, Urbana, IL.

48.

Bentler

P.M.

(1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation modeling, 6, 1–55.

49.

Iwashita

McNamara

Elder

(2001). Can we predict task difficulty in an oral proficiency test? Exploring the potential of an information processing approach to task design. Language Learning, 21, 401–36.

50.

Jenkins

(2006). Current perspectives on teaching world Englishes and English as a lingua franca. TESOL Quarterly, 40(1), 157–181.

51.

Kachru

(1985). Standards, codification and sociolinguistic realism: The English language in the outer circle. In Quirk

Widdowson

H.G.

(Eds.), English in the world: Teaching and learning the language and literatures (pp. 11–30). Cambridge: Cambridge University Press.

52.

Kachru

Nelson

(2006). The handbook of world Englishes. Oxford: Blackwell.

53.

Kang

(2008). Ratings of L2 oral performance in English: Relative impact of rater characteristics and acoustic measures of accentedness. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 6, 181–205.

54.

Kim

H.J.

(2005). World Englishes and language testing: The influence of rater variability in the assessment process of English oral proficiency. Unpublished doctoral dissertation. University of Iowa, Iowa city, IA.

55.

Kline

R. B.

(2005). Principles and practice of structural equation modeling (2nd ed.). New York: Guilford Press.

56.

Kortmann

Schneider

E. W.

(Eds.). (2005). A Handbook of varieties of English: A multi-media reference tool. Berlin and New York: Walter de Gruyter.

57.

Kunnan

A. J.

(2010). Test fairness and Toulmin’s argument structure. Language Testing, 27(2), 183–189.

58.

Levis

(2005). Changing contexts and shifting paradigms in pronunciation teaching. TESOL Quarterly, 39, 369–377.

59.

Lindemann

(2002). Listening with an attitude: A model of native-speaker comprehension of non-native speakers in the United States of America. Language in Society, 31, 419–441.

60.

Lindemann

(2003). Koreans, Chinese or Indians? Attitudes and ideologies about non-native English speakers in the United States. Journal of Sociolinguistics, 7(3), 348–364.

61.

Lowenberg

P.H.

(2002). Assessing English proficiency in the expanding circle. World Englishes, 21(3), 431–435.

62.

MacCallum

R. C.

Widaman

K. F.

Zhang

Hong

(1999). Sample size in factor analysis. Psychological Methods, 4, 84–99.

63.

McArthur

(2003). The Oxford guide to world Englishes. Oxford: Oxford University Press.

64.

McKenzie

R. M.

(2008). Social factors and non-native attitudes towards varieties of spoken English: A Japanese case study. International journal of Applied Linguistics, 18(1), 63–88.

65.

Meierkord

(2000). Interpreting successful lingua franca interaction: An analysis of nonnative-/ non-native small talk conversations in English [Electronic version]. Linguistik online, 5(1/00). Retrieved July 20, 2010, from www.linguistik-online.com/1_00/index.html.

66.

Mueller

D.J.

(1986). Measuring social attitudes: A handbook for researchers and practitioners. New York: Teachers College Press.

67.

Newton

Asimakopoulou

(2009). Health status measurement, reliability and internal consistency. In Kattan

(Ed.), Encyclopedia of medical decision making (pp. 579–582). Thousand Oaks, CA: SAGE Publications.

68.

O’Loughlin

(2002). The impact of gender in oral proficiency testing. Language Testing, 19(2), 169–192.

69.

Orr

(2002). The FCE Speaking test: Using rater reports to help interpret test scores. System, 30(2), 143–154.

70.

Pett

Lackey

N.R.

Sullivan

J.J.

(2003). Making sense of factor analysis: The use of factor analysis for instrument development in health care research. London: SAGE Publications.

71.

Rubin

D. L.

(1992). Nonlanguage factors affecting undergraduates’ judgments of nonnative English-speaking teaching assistants. Research in Higher Education, 33, 511–531.

72.

Ryan

E. B.

Carranza

M.A.

Moffie

R.W.

(1977). Reactions toward varying degrees of accentedness in the speech of Spanish–English bilinguals. Language and Speech, 20(3),24–26.

73.

Schafer

J.L.

Graham

J.W.

(2002). Missing data: Our view of the state of the art. Psychological Methods, 7, 147–177.

74.

Seidlhofer

(2004). Research perspectives on teaching English as a lingua franca. Annual Review of Applied Linguistics, 24, 209–239.

75.

Spolsky

(1993). Testing across cultures: An historical perspective. World Englishes, 12(1), 87–93.

76.

Symonds

(1928). Factors influencing test reliability. The Journal of Educational Psychology, 19(2), 73–87.

77.

Tabachnick

B.G.

Fidell

L.S.

(1989). Using multivariate statistics (2nd ed.). New York: HarperCollins.

78.

Taylor

(2006). The changing landscape of English: Implications for language assessment. ELT Journal, 60, 51–60.

79.

Widdowson

H.G.

(1994). The ownership of English. TESOL Quarterly, 28(2), 377–389.

80.

Wigglesworth

(2001) Influences on performance in task-based oral assessment. In Bygate

Skehan

Swain

(Eds), Researching pedagogic tasks, second language learning, teaching and testing (pp. 186–209). Harlow, UK: Longman.

81.

Worthington

R. L.

Whittaker

T.A.

(2006). Scale development research. A content analysis and recommendations for best practices. The Counseling Psychologist, 34, 806–838.

82.

Zahn

C. J.

Hopper

(1985). Measuring language attitudes: The speech evaluation instrument. Journal of Language and Social Psychology, l4(2),113–123.

83.

Zhang

Elder

(2011). Judgments of oral proficiency by non-native and native English speaking teacher raters: Competing or complementary constructs? Language Testing, 28, 31–50.