by Alyssa Simpson Rochwerger


SAN FRANCISCO – I’ll never forget my “aha” moment with bias in AI. I was working at IBM as the product owner for Watson Visual Recognition. We knew that the API wasn’t the best in class at returning “accurate” tags for images, and we needed to improve it.

I was nervous about the possibility of bias creeping into our models. Bias in machine learning (ML) models is the exact sort of problem the ML community has seen time and again, from poor facial recognition of diverse individuals to an AI beauty pageant gone awry and countless other instances. We looked long and hard at the data labels we used for our project and, at first blush, everything seemed fine.

Just prior to launch, a researcher on our team brought something to my attention. One of the image classes that had trained our model was “loser.” And a lot of those images depicted people with disabilities.

I was horrified. We started wondering, “what else have we overlooked?” Who knows what seemingly innocuous label might train our model to exhibit inherent or latent bias? We gathered everyone we could — from engineers to data scientists to marketers — to comb through the tens of thousands of labels and millions of associated images and pull out everything we found objectionable according to IBM’s code of conduct. We pulled out more than a handful of other classes that didn’t reflect our values.

My “aha” moment helped avert a crisis. But I also realize that we had some advantages in doing so. We had a diverse team (different ages, races, ethnicities, geographies, experience, etc.) and a shared understanding of what was and wasn’t objectionable. We also had the time, support, and the resources to look for objectionable labels and fix them.


KPMG keynote: Addressing AI bias, ethics, and social responsibility


Not everyone who is building an ML-enabled product has the resources of the IBM team. For teams without the advantages we had, and even for organizations that do, the prospect of unwanted bias looms. Here are a few best practices for teams of any size as they embark upon their ML journey. Hopefully they help avoid unintended negative consequences like those we almost experienced.

To avoid these sorts of biases in AI, it’s best to start by defining the decision you’re asking your model to solve for. As I mentioned above, with Watson, because our problem was very broad, we used very broad training data, and that led directly to some problems. But that’s just the start. Ask yourself:

● Is your AI making decisions that are usually made by humans? If so, there could be biases in that set of decisions you’ll need to carefully look at.

● Is the decision simple, with black and white answers? Because often, when you peel back the onion and look at real-world examples, there is context you might’ve missed. Think about the “loser” label I referenced above. We couldn’t have predicted that, but it was in our data regardless.

● Is there a protected class or status involved in this decision? Some cases are easy (is this a specific crop or a weed?). Some are harder (is this clothing intended for a man or a woman?). Some can be hidden (is the bias in the data directly? Think here about sentencing algorithms that may include demographic data that’s actually masked as geography data).

Those questions can help you deduce what might go wrong before you start building your model. Next, you’ll want to define the attributes upon which you’d like decisions to be made.


Related: Best practices for holding AI accountable


Define attributes to create guidelines

For example, if you’re creating a computer vision model that’s answering a fairly straight-forward question like “is this a human?” you need to actually define what you mean by “human.” Do cartoons count? What about court sketches? What if the person is partially occluded? Should a torso count as “human” for your model? What about just a hand? This all matters. You need clarity on what “human” means for this model. If you’re unsure, just ask people the same question about your data. You might be surprised by the ambiguities present and the assumptions you made going in.

At this point, you should know both what you’re solving for and what could go wrong. In essence, you should know “what is this thing we’re building?” and “what are some things that might go wrong for our end user?” Once you have a framework here, it’s of paramount importance to deeply review your data.

After all, this is where bias is often hidden. A few years back, researchers at the University of Washington & the University of Maryland found that doing an image search for certain jobs revealed serious underrepresentation and bias in results. Search “nurse,” for example, and you’d see only women. Search “CEO” and it’s all men. The search results were accurate in certain ways–the pictures were indeed of nurses and CEOs–but they painted a world in which those jobs were uniformly held by women or men, respectively. This is just one example, but it shows how bias can lurk in data without you being able to readily identify it.

You need to think about these issues when you’re reviewing your data. It’s one of the reasons why having a diverse team involved is crucial. Diverse backgrounds help ensure that your team will be asking different questions, thinking about different end users, and, hopefully, creating a technology with some empathy in mind.


Related: Tackling trust in machine learning and neural networks


A few things that are must-asks:

● Where did you get your data from? Could there be sourcing bias? A facial recognition model built in a college lab with collegiate faces might have issues with children or the elderly, for example.

● Do you have enough examples of edge cases? Without them, your model will have trouble identifying unlabeled examples that are outside the norm. Showing a model more male nurses and more female CEOs will help it to spin out less biased results.

● Are you thinking about your end user enough? While I realize I mentioned this above, it’s really crucial. Don’t just think of you as an end user; think of as many end users as possible for your project. Test with those end users. Doing so will help you find the solvable problems now before your model is in production.

At this point, hopefully, two themes are crystalizing here. One is that you need to define your problem and your end users carefully and plan for your outcomes. The second is that a lot of potential issues can be solved with careful attention to your training data.

Better Data Begets Better Models

After all, machine learning models learn from data. Good data makes good models, bad data makes bad models, and biased data makes biased models. In fact, the steps customers take to tune models to remove bias is directly analogous to how a customer tunes a model to account for changing business conditions or algorithmic uncertainty, generally. It all boils down to getting better data.

Knowing what I know now, I’d argue it’s both negligent and reckless to launch an AI system into a production without accounting for bias with some basic best practices. Essentially, they boil down to:

● Be transparent and open regarding what data trained the system, where it was collected, how it was labeled, what the benchmark for accuracy was, and how that’s measured.

● Declare the purpose of the decision making and the criteria through which that decision is made.

● Be empathetic. Understand that you will have different end users and they’ll all use your system differently. Imagine what their experiences might be and build for those, in addition to the ones you inherently expect.

● Take feedback! Ensure there is a mechanism to request question an answer, get a human judgement, or gracefully fall-back in low-confidence situations to not be overly-reliant on an AI system. Just like humans, it’s ok for the robot to say “I’m not sure.”

● Learn! When an outcome is questioned, ensure there is a way to give feedback, retrain, and ensure that the model is actively learning from new examples and real-world data.

It’s not impossible to reduce unwanted bias in your models. It takes some grit and hard work, sure, but it reduces down to being empathetic, iterating throughout the model building and tuning processes, and taking great care with your data.


Alyssa Simpson Rochwerger is VP of Product for Figure Eight