The Challenges of Mitigating Bias in Generative AI

An academic treatise from a professor at the University of Tennessee specializing in deep learning who is also a research scientist at Amazon

Michel Ballings, University of Tennessee Professor and Research Scientist at Amazon

September 21, 2023

8 Min Read
An abstract drawing of a human head looking at a cube
Getty Images

Generative AI has reshaped the economics of content production by offering efficiency gains, cost savings and new opportunities for copywriting. It can quickly generate high volumes of content for various purposes, such as social media posts, product descriptions and blog articles. This volume of content can help businesses maintain a consistent online presence without the need for extensive human intervention. Because generative AI can produce this content at a fraction of the time that a human would need to produce content of similar quality, businesses can save time and reduce costs, thereby resulting in a step change in the economics of content production.

But the economics of content generation have not only changed for well-intended businesses and professionals, they have also changed for domestic and foreign bad actors who want to influence anyone who consumes the internet. This malicious influence can come in the form of election manipulation, social engineering and fake news in general. Before generative AI came into existence, the limiting factor for bad actors to produce biased content was research, language skills, and copywriting time. With generative AI, bad actors can mass-produce high-quality content in any language.

The learning loop

What will happen if a small group of bad actors start flooding the internet with biased content? To understand the risks, we need to talk about the learning loop of generative AI. The first step in this loop is to download a large corpus of content from the internet. This includes websites, blogs, articles and social media. Second, generative AI is trained or updated using all of these data. Third, actors generate content and publish this content to the internet. This can include social media, blogs, and news sites. This new content will then be downloaded, and the whole cycle repeats. Where can this go wrong? Since the responses of the model are based on the patterns and information present in the text that it was trained on, the updated model will generate more biased text if the new training data contains more biased text.

This means that if a small but productive group of bad actors uses generative AI to produce and publish high volumes of biased content, the model will become infected with bias. Good actors who use normal prompts will then inadvertently also start producing biased content in a never ending cycle.

What can we do to prevent this from happening?

There are three places where we can intervene. First, we could filter out biased data from the training data. If the training data are not biased, the model will not produce biased content. Second, we could filter or block user prompts that generate biased content before sending the prompt to the model. Third, we could filter and block biased generated content before sending it to the user.

Unfortunately, filtering turns out to be very hard, except when bias is obvious. To see why, we need to understand that generative AI produces open-ended content that varies with repeated invocations, even if the prompt is identical. To learn how to do this it needs to train on open-ended content. This is different from traditional applications of machine learning where inputs and outputs are structured, and where outputs do not vary if inputs are constant.

Consider a prediction model to assist in approving or denying credit card applications. We would want the model to yield fair predictions. For example, if two people have the exact same credit history and profile except for gender, the model should return the exact same approval probability. This is straightforward to accomplish with structured data and models. We could simply remove gender from the inputs, calibrate the model for gender, and audit the model to ensure that the error rate of the model is the same for male and female applicants.

Can we apply a similar approach with generative AI? For example, consider the following prompt: `Professor Hewett started teaching class and.’ To ensure that the model does not sustain stereotypes, we may want to enforce that the completions of the model assign female and male pronouns equally likely. Of course, we should then probably also do this for other jobs, such as garbage(wo)man, police officer, gynecologist, nurse, and landscaper. This can quickly become intractable. The model also needs to make sure that its answers are consistent with the context of the prompt. What if the prompt mentions long hair? Should this result in completions that are more likely to use female pronouns?

Because of these challenges, filtering will most certainly have to involve guardrail models. Such models predict the level of bias in content and require human-annotated data. For example, humans would have to create a table of data with two columns. One column would contain open-ended text and the other column would denote whether - yes or no - the text is biased. Creating the training data will be a daunting task, because humans would have to read the text and decide whether it is biased. Building consensus on the definition of bias will be even harder.

So, what can we do? If we see generative AI as a large-scale natural experiment involving all people who have access to the internet, estimated at five billion, then we should make sure that we can undo the effects of the experiment by wiping the internet clean of AI-generated content. We could say we need an insurance policy in case the experiment does not go as hoped. This brings us to the following question: How can we identify AI-generated content? The answer is watermarking.

Insurance policy: watermarking

It is easy to understand watermarking when the output of the model is an image. As a last step in image generation, we could place an identifier somewhere in the image. We can think of this identifier as a specific pattern of pixels. The pixels could be arranged to be visible to the naked eye, or not. We can then run this image through a simple pattern recognizer to determine whether an image is generated by AI. But watermarking is more complex when the generated content is text. Let’s first discuss how text generation works.

Consider the prompt ‘A long time ago in a galaxy far, far away’ from Star Wars. Imagine we go to the training data and find one hundred instances of this sentence. We can then make a list of all the immediate next words and count how many out of a hundred times each word occurs, and turn that into a percentage. For example, we could have 'rebels: 30%, imperial: 50%, the: 5%, princess: 15%.' Next, imagine we would then sample from this list, giving words with a higher percentage a higher chance to be selected.

In this case we would likely end up with ‘A long time ago in a galaxy far, far away imperial.’ We can then use this resulting sentence and repeat the process. This is quite similar to how a generative model works, except that the model can generalize beyond the training data. For example, instead of selecting a word from the list with the four words above, it may select another word that it thinks is acceptable.

How can we watermark this process? In a recent study, researchers at the University of Maryland propose that each time when the algorithm selects a word, to randomly divide all possible words in two lists, a green list and a red list. The model is then only allowed to select from the green list. Humans would never know what these two lists are so text generated by humans would never consist of words that only come from the green lists (since people do not act this way). How can we then check a given text for a watermark? The watermarking procedure would store for each next word selection step all the words that come before it, and the green and red list. We can then take any text, verify if the text has ever been generated, and count how many words come from the red lists. If there are no words from the red lists, the text was generated by the model. If there are many words from the red list, the text was generated by a human.

What is the fine print of this insurance policy? Well, somebody could go and create their own generative AI, and leave out the watermarking. How easy would this be? It is prohibitively expensive to do so, and since this technology is evolving rapidly, it would require continuous investment. Of course, this does not prevent an entity such as a foreign nation to train a generative model and use it to influence another nation, but at least it is going to cost them.

What else can be done? One could train models to detect generated text. We could take known human-written text, and known AI-generated text, and train a model that discriminates between the two. There are two downsides to this. First, in current times it is becoming impossible to say whether an article or piece of text is completely free from AI generation. Human authors might have used AI to assist them in writing text, and this will confuse the model. Second, generative models can be trained to explicitly avoid detection. Creating such a model will therefore be challenging.


We are in the middle of a large-scale experiment in which we are unleashing a new technology into the world giving unprecedented productivity to internet users who aim to write text or generate images. This new technology can be used to inflict damage to others at all levels of society, including nations. Bad actors can poison the training data on which future models will be trained. Therefore, in the near-term, we should be advised to watermark generated text and images to ensure that we can identify generated content. In the long term, continued investment will be necessary to identify unmarked generated text.

Read more about:

ChatGPT / Generative AI

About the Author(s)

Michel Ballings

University of Tennessee Professor and Research Scientist at Amazon

Michel Ballings is an associate professor at the University of Tennessee specializing in deep learning, reinforcement learning and data engineering. He is also director of the university's JTV Center Intelligence Lab and a research scientist at Amazon.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like