Friday, November 3, 2017

Averages Lie

Averages lie. Also called the arithmetic mean by pretentious math types1, the average is what you get when you add a bunch of values together, then divide by the total number of items.

Yep those averages. They lie.

Let's look at a concrete example. A dutch hip-hop artist and wannabe entrepreneur, let's call him Double-V, has an idea for a product. For something to use for the discussion, let's say that his product is a diet scam called Raps, by Vit Voorks.2 He decides that he wants to structure his sales as a multi-level marketing scheme3. Let's say he has 11 sales reps and says in his recruiting information:
The average salesperson made $500 in commissions.
That sounds like a pretty good deal. But is it?

Intuitively, for most people, when you hear that the average commission total was $500, you assume that if you pick a person, at random, from the collection of sales people, that their commission should be somewhere around $500. This is not true. The distribution of commissions could look like this:
500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500
Which, if you add together and divide by 11, like they taught you in grade school, gives you the average salary $500. This is the view that Vit Voorks wants you to take. He wants you to think everyone makes a bunch of money. In reality, the commissions could also look like this:
4000, 400, 200, 200, 100, 100, 100, 100, 100, 100, 100
Or even worse:
5500, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Where one person makes $5500, and the rest make nothing. If you do the math again, you'll note that the average in all three cases is $500, but in the last case, the majority of people made absolutely nothing.

The average tells you nothing about the distribution of those values. Like I said, averages lie.

What if we have this data, and we want to be honest about it? We want to tell people how the salaries are distributed, not just the average. To do that we'll use something called the standard deviation. (For the rest of this, I'm going to switch to the more correct term mean instead of average.)

The standard deviation is a measurement of how far from the mean value the people are. Let's look at the standard deviations4 of our three examples:
  1. Salaries: 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500
    Standard Deviation: 0
  2. Salaries: 4000, 400, 200, 200, 100, 100, 100, 100, 100, 100, 100
    Standard Deviation: 1063.01
  3. Salaries: 5500, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
    Standard Deviation: 1513.82

In best case the standard deviation is 0, which means that there is no variation in the salaries. In the worst, it sits at $1513.82, which means that the salaries are spread more widely.  Thanks to some statistical rules5, we can say with reasonable confidence that about 80-90% of people make a salary within 3 standard deviations of the average. Let's look at the numbers:
  1. Skipping this one, everyone makes the same amount of money.
  2. Mean: 500
    Standard Deviation: 1063.01
    3 Standard Deviations: 3189.04
    80-90% of people should make: 3689.04 to -2689.04
  3. Mean: 500
    Standard Deviation: 1513.82
    3 Standard Deviations: 4541.47
    80-90% of people should make: 5041.47 to -4041.47
In case the numbers aren't clear, here's a rule of thumb: the larger the standard deviation gets, the less likely it is that any random person makes that amount.

A quick sidebar about the two other things you probably learned in elementary school, then quickly forgot: median and mode.

Median is the center value. If you arrange the salaries in order from largest to smallest, then pick the one in the middle of the list, that's the median. Medians are better suited for data which is mostly clustered around one value, with a few values much higher or much lower.6

Mode is the most common value. If you put people into groups based on how much money they made from this scheme, the salary of the people in the largest group would be the mode. (That angry group of 10 people who made $0. Sorry imaginary people.)

This post is starting to get long,7 so we'll leave the topic here for now. Hopefully this overview gave you some information that will help you think critically about averages. If this was useful to you, please share it. If you are a mathsy person and have a correction, feel free to leave a comment below, or send me a message on twitter @thebecwar. 

1 Myself included.
2 No, this wasn't chosen to be similar to some product.
3 Also known colloquially as a pyramid scheme.
4 For those who have been exposed to this before, I'm using the population standard deviation, because we have the entire population.
5 There's a lot of handwaving here, since there's no guarantee that the salaries fit into any of the distributions that Mr. Chebyshev's inequality covers. Sorry math nerds.
6 This is why it's often used when talking about home prices in a city.  The median completely ignores the really expensive houses, and the really cheap houses.
7 He says after writing too many words already.
Yes it was

No comments:

Post a Comment

All comments are moderated. I have a life, so it may take some time to show up. I reserve the right to delete any comment for any reason or no reason at all. Be nice. Racist, homophobic, transphobic, misogynist, or rude comments will get you banned.

Programmer vs Software Engineer: The Interview

A common question presented in interviews for developer positions goes something like this: Given an array of numbers, write a function th...