This weekend, I was helping a friend with his startup. He was really frustrated - after making a few adjustments based on the data he collected - his profits seemed to have gone down instead of up.
Since he "followed the data", he was convinced that he made the right decision. He asked me to help him figure out what went wrong.
"His data was LYING to him"
Turns out... his data was literally "lying" to him. This wasn't a case of "garbage in = garbage out", but one of those rare/tricky incidents where your data can actually trick you into making the opposite decision.
Data, data, data... We’re increasingly becoming a society obsessed with data! Important decision-making meetings will often parrot the phrase “Well...What does the data say”. Being “data-informed” is all well and good. But using it at face value to drive decision making can be rather dangerous. In this post, we will discuss one of the ways data can trick you into making the wrong decision - The Simpson's Paradox
The Simpson's Paradox
This Article is Illustrated in an Animated Video for better understanding
In 1973 UC Berkeley was sued for sex-discrimination. It turned out of all the female students who applied - only 35% of them were admitted. While out all the male students who applied 44% of them were admitted.
The data raised a lot of eyebrows. And the witch-hunt was on! UC Berkeley set out to find the main culprits of this gender discrimination. To do this they broke open the data to see which departments were mainly responsible for this GENDER BIAS --- And here is what they found:
Now this is where the data gets funny. After breaking open the data, we see a different story. Out of the 6 departments 4 of the departments accepted women more than men. There definitely was a gender bias - but it was in favour FOR the women. Not against!
But that begs the question? Why did the aggregated data tell a different story?
This is a classic case of the Simpson’s Paradox - when grouped-up data tells the opposite story of the ungrouped data. This happens because of a confounding-factor that is hidden from sight WITHIN the data. So what's this "hidden factor" that's causing all the mischief? Take a look at the first and the last rows of the table:
You’ll notice that Department A has a pretty high acceptance rate - especially for women at 82%! However, out the 4000+ women only a 108 of them applied to this department. That’s only 2% of all women who applied across departments.
On the other hand, 825 of the men applied to Department A. That’s 10% of all the male applicants. You may have already spotted the mischief. But let’s go on. Take a look at the last row. Again, the women have a higher acceptance rate than the men. But over here - Department F, in contrast to Department A, has a very LOW acceptance rate.
And this is where it goes wrong - Compareed to the men, a much larger portion of the women applied to this low-acceptance department. Around 4% of all the men applied here. While 8% of all the women applied to Department F.
So in truth, women weren’t being discriminated against. It just so happened that a large proportion of them were applying to a low-acceptance rate department while a large proportion of men were applying to high-acceptance rate department. That skewed the overall results. This sort of data mischief - The Simpsons Paradox - can happen everywhere. Even in businesses who use data to make decisions.
Here’s a business-case example. A CEO and his team were deliberating whether to use a One-Click advertisement campaign or Two-Click campaign. That’s when the marketing manager - who happened to support the Two-Click campaign showed him some data:
Single Click had more users allocated to it, and thus more revenue - but the RPM (revenue per thousand users) is higher for double click. When you look at the data - the decision is obvious. Double Click is generating more money per user - so they should go with double click, correct?
Turns out, picking the Double-Click campaign would have been a costly mistake. Let’s break open the data again - into its subgroups of International users and Local users:
Local Users SubGroup
International Users SubGroup
Suddenly, the data tells a different story. Single click is outperforming Double-Click in both subgroups - Local AND International? How is this possible?
Simpson’s Paradox at play again. The grouped up data has a hidden factor that tells the opposite story of the ungrouped data.
In this case, the hidden factor was that only 33% of international users were shown the double-click page, while only 58% of the local users were shown the double-click page. And in general the local users had a much higher RPM than international users. So the local users who had a much higher proportion of double click users and a higher RPM skewed the overall data.
Simpson’s Paradox can be tricky - the key is to look out for any hidden variables that may be influencing your data!
Don’t rely too much on your data. If something smells fishy - look into it. Do not trust your data blindly.