Tuesday, April 15, 2008

What's the chance of rain?

Everyday probability barely shows up in weather forecasts these days. For example, Yahoo's weather will say something like "a few showers" to mean that there is a chance of showers. However, if you are a purist, go to the National Weather Service (NWS) site, where they still make predictions using probabilities. See the Chicago forecast for this week in Yahoo and at the NWS site (the online NY times can barely be bothered to give any information at all, unless you can interpret their icons).

When it comes to data, I've always felt more is more, and so if I am really interested in the weather, I go to the NWS site where I'll get more than just "chance of rain" or "a few showers."

But what does it mean when the weather forecast says there is a 30% chance of rain Wednesday and a 50% chance of rain Thursday? If we focus on a single time period, say Thursday, the conclusion if pretty clear: there's a 50-50 chance of rain. Put another way, when encountering conditions like this in the past, the NWS model data shows rain half the time and no rain half the time.

The inference becomes more difficult when we want to ask a more complex question. For example, suppose I'm going to Chicago Wednesday and returning Thursday night. I want to know whether to bring an umbrella. Since I hate lugging an umbrella along, I only want to bring one if there is at least a 75% chance of rain at some point while I'm there.

It turns out that the answer to this question cannot be determined with the information given (don't you just love that choice on multiple choice tests?).

Before we explain why, though, we need some definitions and notation.
To do the math for this, we generally define each possible outcome as an event. In this case, we have the following events:
Event A: Rains Wednesday
Event B: Rains Thursday

We are interested in the chance that either Event A or Event B occurs. We have a shorthand for expressing the probability of Event A: "P(A)".

There is a simple probability formula that is very useful here:
P(A or B) = P(A) + P(B) - P(A and B)
This formula says that the probability of Event A or Event B happening is the probability of A plus the probability of B minus the probability that A and B both happened (the event that A and B occurred is called the intersection of Events A and B). This makes sense because if we just added them (as you might intuitively do) we are double counting the times both events occur, and thus we need to subtract out the intersection once at the end.

In some cases P(A and B)=0. In other words, Events A and B never occur together. You may have noticed this comes up when you toss a coin: it is never both heads and tails at the same time (except for that time I got the coin to stand on its side). Events like these are called mutually exclusive.

In other cases P(A and B)=P(A)*P(B). This means the probability of A and B is the product of the probabilities A and B. In this case, the two events can both occur, but they have nothing to do with each other. Events like these are called independent events.

In still other cases, P(A and B) is neither P(A) + P(B) or P(A)*P(B).

If we assume the events A and B are mutually exclusive, then there's an 80% chance (50+30) of rain either Wednesday or Thursday. This seems unlikely though, because most storms could last more than an evening.

If we assume the events A and B are independent, then there's an 65% chance of rain either Wednesday or Thursday. This is a little more complicated to calculate, because we need to figure out the chances of it raining both Wed. and thursday, which we assume is independent and thus is P(A)*P(B)=30%*50%=15%. Thus:
P(A or B)=P(A) + P(B) - P(A and B)=50% + 30% -15%=65%.

[we could also figure out the chances of not raining either night. Since the chance of rain is independent, the chance of no rain is also independent. Also, the chance of rain plus the chance of no rain must be one. Thus P(no rain Wednesday)=1-P(A)=100%-30%=70%. Similary, P(not B)=100%-50%=50%. Then, the chance of no rain either time period = P(not A and not B)=70%*50%=35%. Thus, there is a 35% chance it will not rain either night, and we can conclude there would be a 65% chance of rain one of the nights, of course all hinging on the independence assumption].

okay. So finally, we have two probabilities 80% and 65%, based on two different and rather extreme assumptions. On the high side, 80% is the most extreme. We can see this by seeing that in order to get a larger number, we'd have to plug in a negative probability for the value of P(A and B) in the general formula (which does not assume independence or anything else):
P(A or B) = P(A) + P(B) - P(A and B)

Since probabilities must always be at least 0 and at most 100%, we cannot have a negative number for P(A and B). So at most, the chance of rain Wednesday or Thursday is 80%.

But what about the least the chance might be? Independence seems a pretty extreme assumption in the other direction, but in fact it is not. What would lead to the smallest probability is if the two events A and B were highly related--in fact so related that P(B)=1 if A occurs. This would mean the P(A and B)=30% (the smaller of P(A) and P(B)). This would lead to a probability that it rains either Wednesday or Thursday of just 50%:
P(A or B) = P(A) + P(B) - P(A and B) = 30% + 50% - 30% = 50%

So now that we've got rain down, let me go back to the original impetus for this blog: it is easy to make the wrong inference when given information about the chances of a series of events.

The recent Freakonomics blog about chances of birth defects addresses this issue. In it, Steven Levitt describes a couple who was told that there was a 1 in 10 chance that a test, which showed an embryo was not vailable, was wrong. The test was done twice on each of two embryos, and all four times the outcome was that the embryos were not viable. Thus, the lab told them that the chances of having healthy twins from two such embryos was 1 in 10,000. Of course, after reading about rain above, you recognize this as the application of the independence assumption (1/10 times 1/10 times 1/10 times 1/10 equals 1 in 10,000). The couple didn't listen to the lab though, and, nine months later, 2 very vaiable and very healthy babies were born.

Post hoc, it seems the lab should have (at least) said the chances were somewhere between 1 in 10 and 1 in 10,000. In addition, the 1 in 10 seems like an awfully round number--could it have been rounded down from some larger probability (1 in 8, 1 in 5, 1 in 3, who knows?). Levitt wonders whether the whole test is just nonsense in the first place.

So what do you do when confronted with a critical medical problem and a slew of probabilities? There's no easy answer, of course, but I believe gathering as much hard data as possible is important. Then make sure you distinguish between the inferences made by your doctor, nurse, or lab technician (which are more subject to error) and the underlying probabilities associated with the drug, the test, or the procedure (which are less subject to error).