Handling Bias in Data Science

Ignoring it won't make it go away, folks!


Data is biased, models are biased, humans are biased - pay attention and hire a diverse workforce!

Here’s the deal:

I think about diversity and the engineering workforce, like, a lot.  It’s been shown approximately 7 billion times that businesses pull in more revenue, make better products, and are better places to work when they have a diverse employee base and an inclusive culture.  So yeah there’s a ton of research on this and I’m not going to rehash it.

But what about your data itself?  Shouldn’t data and data science be, you know, math?  And therefore kind of immune to prejudice?  Data is really just 1s and 0s - electrons!  Electrons aren’t biased, right?  

But - as we all know by now - that’s not true. Bias can happen in the data itself, and everyone in data-related fields should be thinking about this a lot, especially as data-based fields grow in popularity and adoption.  There are a lot of ways skew can be introduced, from how the data is collected to how it’s presented. 

This is an incredibly complex and hairy topic, and like many I also get overwhelmed by the breadth and depth of the problem 😫 But that doesn’t mean we can ignore it, so in the interest of brevity I’m going to hone in on three basic sources of bias here:

  1. The data itself 📋

  2. Algorithms 📈

  3. Humans 🤓

Your data might be a problem 😱

Datasets can be incredibly straightforward and transactional (e.g. invoice data) but they can also be skewed in several ways - due to how they’re collected, how they’re sampled, how they’re interpreted… think about any push poll as an obvious example. 

Data issues can be more subtle, too. Think about something that seems simple like a “gender” field ⚤or a “race” field. Most datasets are split into only two or three genders and a handful of races, and ignore the large breadth of identities, leaving a skewed dataset that is ignorant of experiences of a subset of people.  That dataset then can be used to measure or predict behavioral patterns, and would then present biased results.  

Bottom line: if your data is problematic, then those issues can be reflected or even amplified in models.  Yikes!

Modeling bad behavior 📊

Have you heard the one about a certain company named after a South American river that built a recruiting tool that was biased against women?  Turns out, when you’ve only hired men and you train your models using data from your predominantly-male workforce, you end up with an inequitable model 🙄 Go figure! 

Algorithms based on skewed data can find false correlations, resulting in incorrect or prejudiced business decisions and products based on those false correlations.  Don’t be like this - make sure you start with adequately diversified datasets, and similarly make sure you don’t end up with biased algorithms.  If you don’t, there could be serious life-and-death consequences!  

The solution here is to have data scientists checking for prejudice in the data and models.  But if those folks are themselves biased, then what?  So let’s talk about humans.

Congratulations! You’re part of the problem, too! 🏆

We are all biased, even if unconsciously, and we should work on it. You should work on your own internal judgements, and also learn how to call out other people on theirs. In the world of data science, this becomes even more important, because it affects the data itself, the algorithms and analytics based on that data, and then the interpretation of that data. Let’s be clear - computers do in fact operate in binary but our world is infinitely more complex, and it’s the humans interpreting that binary-based code that actually bring data to life. 

OK so what can we do about this? 👀

I wrote you a Python script to un-bias your entire data environment! 🎉

...just kidding, that’s impossible, sorry 😬  This is clearly a complex problem and I’ve just touched the surface here, but it’s something that is important to think about and it deserves even more discussion across the board.  With more accurate and less skewed data and data science, companies can make better products and gain better market fit. So in the spirit of being action-oriented, here are three things you can do to address this:

  1. Make sure your data is evaluated for bias, when collected, sampled, and interpreted 📋

  2. Do your research, learn how to identify inequalities in algorithms and their results, and avoid them 📈

  3. Build a diverse and inclusive data team. By having a workforce with broad experience and backgrounds, you can avoid the echo chamber that come with a homogeneous data team🙎🏾‍♂️🙎🏻‍♀️🙎🏿🙎🏼‍♂️🙎🏽‍♀️

How does your data organization evaluate and address bias in data? I’d love to hear more, hit me up on Twitter @ctartow and let me know!