Monday, April 29, 2013

Simpson's Paradox

A North Slope real estate broker (named North) is trying to convince you that North Slope is a more affluent neighborhood than South Slope.  To prove it, he explains that professionals in North Slope earn a median income of $150,000, versus only $100,000 in South Slope.  Working class folks fare better in North Slope also, with hourly workers making $30,000 a year to South Slope's $25,000.

The South Slope real estate broker (named South) explains that North is crazy.  South Slope is much more affluent.  The median income in South Slope is $80,000 versus the North Slope median of $40,000.

Question: Who is lying, North or South?
Answer: It could be neither.
Consider the breakdown of income shown below.

We can see that North is not lying.  Half the hourly South Slope workers earn $20K and half $30K, for a median of 25K.  A similar calculation for the North Slope workers yields an hourly median of 30K.  For professionals in the South Slope, the median is $100K, with half earning $80K and half earning $120K.  In the North Slope, a similar calculation yields the median of $150,000.

South is not lying either.  For the South Slope, the median is $80,000, since more than half of the workers make less than or equal to $80,000 and more than half make $80,000 or more (according to the definition of median, at least half must be above the median and at least half must be below).  For the North Slope, the median is $40,000.

What happened here?  The problem, and the reason for the conflict between the wages according to type of work and the overall wages, is that the percentage of residents in each category does not match.  Thus, though professionals and hourly workers make more in the North Slope, there are far more hourly workers in the North Slope than in the South Slope.  Thus, the overall median (or mean) income is lower in the North Slope.

While Wikipedia has an entry for Simpson's Paradox, a specific example of which I described above, it seems that most people are unaware of it.  My motivation for writing about it is not the made-up example I present above but the fact that I encounter it so much in my everyday work.  I either make my clients very happy by explaining that the 'bad' effect they have found may well be spurious or, anger them when I explain the interesting relationship they have found is a mere statistical anomaly.