Addressing Bias in Large Language Models and Its Impact on Non-Native English Speakers

TL;DR
Generative AI models, such as ChatGPT, have an English language bias due to the disproportionate prevalence of English content on the internet. This bias can negatively impact non-native English speakers, limit cultural diversity and innovation, and potentially lead to misinformation. Addressing this bias requires more data, time, and computational power, which can be costly. However, some start-ups are trying to address the issue of language bias from an accessibility standpoint. While individual initiatives are commendable, mitigating the English language bias requires a collective industry response to prevent AI-induced discrimination against non-English speakers.

Researchers worldwide are sounding alarms about Artificial Intelligence (AI) favouring the English language.[1] Generative AI models, such as ChatGPT, face challenges when responding to factual questions or answering complex reasoning tasks such as recognising names, places, organisations or summarising text in non-English languages.[2] Consequently, they are more prone to producing fabricated or factually incorrect information, as one research shows how an AI chatbot incorrectly responded to a translation query with “This is an English sentence, so there is no way to translate it to Vietnamese.”[3]

The bias towards English is rooted in the underlying algorithm of generative AI applications called Large Language Models (LLMs).[4] LLMs are machine learning algorithms that enable computers to generate "human-like" sentences by estimating the probability of the next likely phrase or word as part of a longer sentence.[5] This probability is calculated using a “token”. A token is a basic unit of code or text that an LLM uses to generate content. It can be a singular word, a common sequence of words, or even just a character found in simple text.[6]

To calculate the probability of a token, researchers utilise vast amounts of words, sentences, and texts gathered from the internet to train LLMs. OpenAI's ChatGPT, for instance, draws from five different datasets, including Common Crawl[7], WebText[8], and Wikipedia.[9] However, due to the disproportionate prevalence of English content on the internet,[10] training datasets are predominantly composed of English text. For instance, 46 percent of Common Crawl's monthly data, which is a free, open access repository of data scraped from the internet, is in English[11], followed by Russian, Dutch, and French, each constituting less than 6 percent.[12] Even less common languages such as Hindi, Turkish, and Malay contribute less than 1 percent to this dataset.[13]

As a result of being predominantly trained on English text, generative AI applications excel at processing prompts in English. Studies indicate that training LLMs with other languages is time-consuming and more expensive for end-users.[14] Lesser-known languages require a greater number of tokens, leading to slower results and limited input space for users.[15] Even ChatGPT-4, which boasts higher proficiency in low-resourced languages than GPT-3, is more effective at solving complex mathematical problems in English than in other languages.[16]

The bias towards the English language can negatively impact non-native English speakers. Research by Stanford University reveals that essays written by non-native English speakers are more frequently incorrectly flagged as "AI-made."[17] This misidentification can have serious consequences for students, academics, and job applicants, potentially resulting in reprimands for perceived unethical AI use.[18] For instance, websites built by non-English speakers are more likely to get lower ranks in search engine results such as Google that downgrade AI-generated content.[19] Thien Huu Nguyen, a computer scientist at the University of Oregon, suggests that generative AI's English bias could undermine cultural diversity and innovation as individuals are pressured to communicate primarily in English.[20]

The responsibility for addressing this bias rests with the developers of generative AI models. Some efforts have been made to rectify this issue, with ChatGPT-4 incorporating a multilingual approach.[21] However, mitigating the English bias in LLMs would require more data, time, and computational power. Further, it would entail substantial costs for businesses looking to offer generative AI services in low-resourced languages. For instance, it would cost a company 10 times more to offer ChatGPT in Burmese than in English as the former generates a larger number of tokens for the AI to process.[22]

A few start-ups are trying to address the issue of language bias from an accessibility standpoint. For instance, Lelapa AI, is an African startup working to create machine learning tools for native languages in South Africa. Although Lelapa has simple functionalities like converting voice to text and detecting names, it works with three African languages - isiZulu, Afrikaans, Sesotho, along with English. Similarly, Devnagri is an Indian AI translation platform that works with 22 Indian languages, making English websites accessible to India’s non-english speakers.[23]

While individual smaller-scale initiatives are commendable, they cannot serve as a substitute for a scalable solution at the industry level, which requires the availability of more data in different languages. In a recent United States Congressional hearing in May 2023, OpenAI's CEO, Sam Altman, was asked about steps to bridge the language gap in ChatGPT. In response, he said he hoped to partner with governments and other organisations to acquire data sets that would bolster ChatGPT’s’s language skills and broaden its benefits to “as wide of a group as possible.”[24] As individual governments and businesses scramble to formulate policies on AI, whether a collective industry response can be achieved remains to be seen.

[1] Wired (2023). “ChatGPT Is Cutting Non-English Languages Out of the AI Revolution”.

[2] Lai et al (2023). “ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning”. Cornell University. Available at:

https://doi.org/10.48550/arXiv.2304.05613

[3] Ibid. Wired (2023).

[4] Ibid. Lai et al (2023).

[5] Google (n.d.). “Introduction to Large Language Models”.

[6] As per OpenAI’s website, a 1000 tokens is about 750 words. 

[7] Common Crawl is a free, open access repository of data scraped from the internet. According to the Common Crawl website, it serves as the primary training dataset for every LLM, and has provided 82% of the tokens used to train ChatGPT.

[8] WebText is OpenAI’s internal corpus of web pages scraped from the internet. It consists of about 8 million documents for a total of 40 GB dataset. Source: Radford et al (2023). “Language Models are Unsupervised Multitask Learners” OpenAI. Available at: https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf

[9] Brown et al (2023). “Language Models are Few-Shot Learners” OpenAI. Available at: https://arxiv.org/pdf/2005.14165.pdf

[10] According to the Centre for Democracy and Technology, the English language accounts for 63% of all websites on the internet. Source: Nicholas and Bhatia (2023). “Lost in Translation”. Center for Democracy and Technology. Available at: https://cdt.org/wp-content/uploads/2023/05/non-en-content-analysis-primer-051223-1203.pdf

[11] Based on the past three “crawls” or data scrapes.

[12] commoncrawl.github.io

[13] Ibid.

[14] Medium (2023). “ChatGPT Bias: 3 ways Non-English Speakers are being Left Behind.

[15] Ibid. Medium (2023).

[16] Jun (2023). “GPT-4 can solve math problems - but not in all languages”. Available at: https://www.artfish.ai/p/gpt4-project-euler-many-languages

[17] Liang et al (2023). “GPT detectors are biased against non-native English writers”. Patterns. Vol. 4, Issue 7, 100779. Available at: https://doi.org/10.1016/j.patter.2023.100779

[18] Guardian (2023). “Programs to detect AI Discriminate Against Non-native English Speakers Shows Study

[19] Ibid. Liang et al (2023); developers.google.com (n.d.)

[20] Ibid. Wired (2023).

[21] Medium (2023). “ChatGPT-4 is here: What’s new compared to GPT-3.5?

[22] Ibid. Medium (2023). “ChatGPT Bias: 3 ways Non-English Speakers are being Left Behind.

[23] https://indiaai.gov.in/startup/devnagri

[24] Ibid. Wired (2023).