## Friday, December 5, 2008

### Are we entering "unprecedented" territory?

The reports of gloom and doom are abounding, and, I must admit, I believe most of them.

I am going to focus purely on the stock market, because the data is readily available, and because I believe the broader economic problems are only just beginning. My March blog pointed out that in the stocks versus bonds 20-year view, stocks almost always won, but the results are much more mixed over shorter periods. I also need to point out that I overestimated the results for stocks by assuming dividends were not included in the indices. For the Dow indices, the subject of much of that discussion, dividends are included (see the Dow Jones site), so the graphs in that blog are correct, but the numbers should not be adjusted further for dividends, meaning that stocks' edge over bonds is less impressive.

Today's post, though, is really about the graph above, showing 1 ,10, and 20 year returns on the Dow since 1928 (from December to December). From December 3, 2007 through December 1, 2008, the Dow lost 37% of its value. This horrible run is beaten only once, from December 1930 to December 1931, when the Dow lost 53%. The years 1930, 1937, and 1974 (again, December to December) were the only other years where the 12 month loss was more than 20%.

Thus, historically, though not unprecedented, the yearly drop in the Dow is, well, statistically "improbable" (that is, if you base your probabilities only on history). While the 10 and 20 year numbers are much more in line with history, they are still on the low end of the distribution. The last time the 10-year change was negative, as it is now, was 30 years ago, in 1978, in the waning years of very tough economic times.

The next few months will start to indicate how deep an economic hole we've dug for ourselves, but the stock market numbers are not encouraging, and the extent to which the economy is dependent on the market (in the sense that assets are tied to it) seems much more like the 20s and 30s than like the 70s. Let's hope I'm wrong.

## Friday, October 31, 2008

### Election Prediction Explained

So here's the explanation.

I am following 3 major websites now:

www.electoral-vote.com – This consolidates polls by state to predict the count. Electoral-vote apparently uses simple averaging to consolidate its data. I prefer this method because it requires little interpretation on their part. Interpretation involves assumptions about bias in the polls, and I believe it is hard to figure out the exact impact of the bias or even the direction. Electoral-vote has Obama at 364. At this time in 2004, they had Kerry at 283 (see this page), whereas his election day total was 252, with the main difference being Florida. More telling, the "strong" Obama States total 264 votes, as opposed to 95 for Kerry at this point.

www.fivethirtyeight.com – This consolidates polls by state to predict the count using some complex weighting system. It’s a neat idea but it’s end result is about the same as averaging, and I am not at all convinced it is better.

They’ve got Obama at 346.5, much more than the 270 needed to win.

www.gallup.com – This well-established survey company is different from the two above in that they actually conduct the polls.

My conclusion from the above---Obama will be the next

So why the change from before, when I said polls are difficult to trust and spoke of biases?

Three reasons:

1) the closer we get to the election, the better correlation between intentions and actions

2) the closer we get to the election, the fewer undecided voters. A recent Reuters poll shows this at about 2%. Even if it is 5% and the undecided break 4 to 1 for McCain, he's going to lose.

3) The biases appear to lean in Obama’s favor: more younger voters likely and more early voters. Very biased reporting from Grandma in S.C. says that lots of young people were out voting early (she spent 2 hours on line to vote early, by the way).

## Thursday, October 30, 2008

### Election Prediction

Obama wins, with 401 Electoral votes.

I'll explain why tomorrow.

## Wednesday, October 8, 2008

### Election Polls

Election polls differ in at least four significant ways from actual voting.

First, polls are typically of around 1,000 people or less, which means that at best, they are statistically precise to within plus or minus three percent. This means that a six point difference between 2 candidates may be nothing more than sampling error (i.e., a statistical anomaly).

Second, polls tend to be of the general population and not of likely Electoral College votes, which is how the election is counted (but see electoral-vote.com for a count of Electoral votes, according to polls). As we know from recent elections, the Electoral vote percentages frequently (and seemingly increasingly) do not correspond to popular vote percentages.

Third, polls are snapshots on how people feel on a certain day. Americans seem to be particularly fickle in their opinions recently, perhaps due to the economic turmoil, so don't trust that today's lead won't disappear tomorrow.

Finally, many polls do not remove unlikely voters (though you do see some figures concerning "likely voters"). Polls of people who do not vote are fairly useless, but pollster's haven't been very successful in predicting who will actually vote. Thus, the tendency is to include respondents who are registered and say they plan to vote, without looking at their demographics to see what they've done in the past.

For all these reasons, if you're an Obama supporter, you should be worried and if you're a McCain supporter, you should have some hope. Either way, vote!

## Thursday, August 28, 2008

### The Atlantic Monthly is criminally misusing statistics

The article talks of the recent increase in violent crime in mid-sized cities. In many of these cities, government housing projects (called "Section 8" housing) have been torn down. In their place, the government has provided the poor with rent subsidies so that they can move to private housing. Rosin describes how Phyllis Betts and Richard Janikowski, of the University of Memphis, tie the increase in crime in these cities to the destruction of these projects. A striking quote in the article is from the Memphis police chief: '“It used to be the criminal element was more confined,” said Larry Godwin, the police chief. “Now it’s all spread out."'

The primary statistical evidence given in the article of an association between crime and former Section 8 residents, is a map that shows areas with high incidents of crime correspond to areas with a large number of people with Section 8 subsidies (i.e., former residents of housing projects). As convincing as this might sound, it has a fatal flaw: the map looks at total incidents rather than crime rate. This means that an area with 10,000 people and 100 crimes (and 100 Section 8 subsidy recipients) will look much worse than an area with 100 people and 1 crime (and 1 Section 8 subsidy recipient). However, both areas have the same rate of crime, and, presumably, the same odds of being a victim of crime (see my earlier blog about the safest place to live for some explanation of the use of rates in measuring crime). Yet in Betts and Janikowski's analysis, the area with 10,000 people has a higher number Section 8 subsidy recipients and higher crime, thus "proving" their theory of association.

Of course, there will be both a greater number of Section 8 subsidy recipients and a greater number of crimes in the area with 10,000 people than in the area with 100 people . Thus, while the map presented in the Atlantic article does indeed seem to indicate that there is higher crime in areas where there are more Section 8 subsidies, this differential might be entirely an artifact of population density, and, in fact, the crime rate may be completely unrelated to where Section 8 subsidy recipients reside. Without an adjustment for population density, the inferences made from the association are statistically meaningless.

## Wednesday, July 2, 2008

### Statistics in Politics - Lies and Damn Lies

Brooks takes issue with Obama’s claim that his fundraising is from a broad base of small donors, and goes on to compare Obama money raised to McCain money raised by special interest group. I am not going to attack the actual dollar figures that Brooks gives. He cites no sources whatsoever, so that makes them hard to attack anyway. Instead, I am going to show how presenting raw numbers without proper context creates a biased picture.

Let’s take Brooks’ first claim: He says “lawyers account for the biggest chunk of Democratic donations” and have donated $18 million, as compared to $5 million for McCain. This sounds like 1) Obama is getting most (“biggest chunk”) of his donations from one big special interest group (lawyers) and 2) Obama is getting 3 times as much of his donations from this group as McCain.

Here’s the problem: Obama has out raised McCain by more than 2 to 1. According to CBS News, Obama’s total amount raised is $295.5 million compared to McCain’s $121.9 million. Thus, the $18 million raised from lawyers represents only 6% of the money raised. Still a lot of money, but it puts the “biggest chunk” in context. McCain’s $5 million raised from lawyers, on the other hand, represents 4% of the total money he raised. Thus, Obama

*is*getting more as a percentage from lawyers but instead of 18 to 5, or 3 times as much, it’s 6% to 4%, or 50% more. Another issue is that there is a difference between individuals who are lawyers and public interest groups for lawyers. Brooks is trying to blur those lines by grouping all their donations together (to be fair, he does not say "special interest groups"). Sure, a lot of lawyers certainly support some of the public interest groups, but others do not. Also, these groups can be at odds with one another, so grouping all lawyers together gives you the bigger number but is inaccurate.

Brooks goes on to compare several other groups of professions. In each of these areas, Obama receives more money in absolute dollars. However, in terms of percentage of total donations, McCain is usually always receiving more: from financial securities workers, McCain gets 36% more as a percent of his total; from real estate workers, McCain gets 94% more; from bank workers, McCain gets 82% more; from hedge fund workers, McCain gets 29% more; from medical/health care workers, McCain gets 4% more.

There are two other areas (in addition to lawyers) where Obama is receiving more in percentage terms. The first is “communications and electronics”, where Obama is getting 106% more in percentage terms. The second is “Professors and other people who work in education.” In that area, Obama gets a whopping 4 times as much as McCain as a percentage of total funds raised. Brooks implies that these are "part of a spontaneous movement of small-money enthusiasts," but he doesn't support that with any evidence showing that these groups are anything more than an unorganized group of individuals--all the polls have indicated that more educated people lean toward Obama, so why wouldn't they give more?

The last thing that Brooks points out is that although, as Obama claims, 90% of his donors gave less than $200, only 45% of his donated money comes from such small donors. This is a good point, and Obama, who has been claiming this for awhile, should be called to the mat on it.

However, it would be more interesting to look at the percent of small donors and money from small donors in McCain’s campaign as a comparison. You can bet that it’s less than 45% of donated money and less than 90% of donors. Yet, a comment on Brooks’ article by a New Republic blogger puts it into context, pointing out that “31 percent of Bush's money in 2004 came from donations of $200 or less (compared to 16 percent in 2000). Kerry, meanwhile, raised 37 percent...” (the blog sites this article on 2004 donations (by Joseph Graf) as its source). Thus, 45% is a lot, but the number has been increasing for both parties, with the most obvious reason that the Internet has allowed candidates to easily reach out to everyone, rather than raising most of their money through $1,000 a plate dinners and the like (campaign finance reform, which limits individual contributions, also had a role in bringing up the percentage raised through smaller donors).

The lesson here is, of course: “don’t believe the numbers.” David Brooks is going to make them look good for McCain--he’s a columnist, not a reporter--just as other columnists are going to make them look good for Obama.

## Thursday, June 12, 2008

**Let’s make a deal problem**

Monty Hall, the host, allows you to choose one of three curtains. Behind one of the curtains is a new car or another big prize, while behind the other two is a year’s supply of shampoo or the equivalent. You choose Curtain 1. Monty opens Curtain 2 and shows you it has a year’s supply of the shampoo. Then he gives you a choice:

a) stick with your original decision, or

b) switch to Curtain 3.

1) always shows a curtain with the shampoo behind it;

2) never reveals the curtain you chose; and

3) randomly decides which of the remaining two curtains to reveal if the curtain you chose contains the car,

then switching to Curtain 3 gives you a 2/3 chance of winning while sticking to Curtain 1 gives you a 1/3 chance of winning.

*Why? *

This problem, like many probability problems, is one of information. Initially, you have no information about any of the three curtains so each choice gives you a 1/3 chance of winning. By showing you the curtain with the shampoo, you have learned nothing new about the curtain you originally chose—because there was no way, whether your curtain had the car or the shampoo, that Monty was going to show you what was behind your curtain. Your curtain had, and still has (as far as you know), a 1/3 chance of containing the car. However, you did get information about Curtain 3: Monty did not choose to reveal it. This could mean one of two things:

A. Curtain 3 has the car, and therefore Monty had to show you Curtain 2, as he would never reveal the curtain with the car (1/3 chance, calculated by taking the 1/3 chance that Curtain 3 has the car and multiplying by the 100% chance that he reveals Curtain 2 when the car is behind Curtain 3); or

B. Curtain 1 has the car, and Monty chose to reveal Curtain 2 (1/6th chance, calculated by taking the 1/3 chance that curtain 1 has the car and multiplying by the ½ chance that Monty reveals curtain Number 2 when the car is behind Curtain 1).

These probabilities do not sum to 1, because we are excluding the outcomes, now impossible, where Monty reveals Curtain Number 3. In order to revise the probabilities to take into account what was revealed by Monty, we need to divide the probabilities in A (1/3) and B (1/6) above by the chances of the two possible remaining outcomes (1/3 plus 1/6 = 1/2). Thus, outcome A (car is behind Curtain 3) has a probability of (1/3) / (½)= 2/3, while outcome B (car is behind Curtain 1) has a probability of (1/6)/(1/2) = 1/3.

The intuition is as follows: Monty always reveals Curtain 2 or 3 when you choose 1, so you do not get any more information about whether it is behind 1 by this revelation, but you do gain information about 2 and 3 from this revelation, since he never reveals 3 if the car is behind it but does sometimes reveal 3 if the car is not behind it. Thus, the fact that Monty did not reveal Curtain 3 tells you something.

[Note: this problem has been around for awhile, but was made famous by Marilyn Vos Savant’s discussion of it and the subsequent outcry by those who insisted her answer, the correct one, was wrong. See, for example: http://www.letsmakeadeal.com/problem.htm]

*Technical Explanation*

There are a whole class of problems in probability that involve updating the chances based on new information. These problems are solved according to Bayes’ Rule, after a law in probability that specifies how to update probabilities with new information (for a full discussion, including discussion of whether the Reverend Bayes was actually the first to discover this theorem, see the Wikipedia entry: http://en.wikipedia.org/wiki/Bayes'_theorem).

To understand Bayes’ Rule, we need to first know the notation used for conditional probability. We use the vertical line ( | ) to denote a condition and, as in prior blogs, P(A) is the probability that event A occurs. Thus, P(A|R2) is the probability that A occurs, given that R2 already occurred. Bayes' Rule is:

P(A|R2) = P(R2|A)*P(A) / P(R2)

So let:

A=event that prize is under Curtain 3

R2= event that Monty reveals the curtain 2 contents

C=event that prize is under Curtain 1

Now we can figure out the right side of the Bayes’ Rule equation, in order to figure out P(A| R2).

We know P(R2|A) = 1, because Monty won’t reveal curtain 3 when it contains the prize and he won’t reveal curtain 1 because you chose curtain 1.

P(A) = P(C) = 1/3 ==> remember, this one is unconditional, so given three curtains, there’s a 1/3 chance of the prize being behind each.

To figure out P(R2), it is useful to note that for any events R2 and A, P(R2 and A) = P(A) * P(R2|A)

In our case, the P(R2) is the sum of the probabilities of 2 exclusive events:

1) prize is under curtain 3 (event A) and Monty reveals curtain 2 (event R2): 1/3 * 1=1/3

2) prize is under curtain 1 (event C) and Monty reveals curtain 2 (event R2) 1/3*1/2 = 1/6.

This sum, 1/3 plus 1/6 is ½=P(R2).

Thus, by Bayes Rule, P(A| R2) = (1*1/3) / ½ = 2/3

Just for fun, now you can compute P(C|R2) = P(prize is under Curtain 1 given that Curtain 2 is revealed) = 1/3 using Bayes’ Rule.

False Positives in Cancer Diagnoses

The outcome of Bayes’ Rule can be very confusing, and is important to keep in mind in more important problems than the Let’s Make a Deal problem. For example, suppose an MRI for breast cancer has a false negative rate of 1/100, meaning that the test will incorrectly indicate that you do not have cancer when you in fact do 1 in 100 times. Similarly, the test might also have a false positive rate of 1 in 100, meaning that the test will incorrectly indicate that you do have cancer when in fact you do not 1 in 100 times (false positive rates for MRIs over time can be much higher, because they are frequently done once or twice a year: see the recent article about a study of false positives in MRIs for breast cancer screening, which were around 25% over time.

Suppose your MRI result just came out positive for breast cancer. What are the chances you actually have breast cancer?

First, it’s useful to know that around 250,000 women a year get breast cancer (see this site) and there are about 60 million women above the age of 40 (see census site), when most cases occur. This represents an annual infection rate of nearly 1 in 200.

P(C) = Probability of breast cancer in a given year = 1/200 = 0.005

P(D| not C) = Probability that MRI diagnosed cancer given that you do NOT have cancer = false positive = 1/100 =0.01

P(N|C) = Probability of MRI did not diagnose given that you have cancer = false negative = 1/100= .01

P(D|C) = 1-P(N|C) = Probability that MRI diagnosed cancer given that you have cancer = 99/100 =0.99

We want P(C|D) = Probability of cancer, given a cancer diagnosis by MRI.

Before using Bayes’ Rule, we can first define P(D) as the sum of the probabilities of all exclusive events that include D. In English, the chance of diagnosis is the sum of 1) the chance that you have cancer and are diagnosed and 2) the chance that you do not have cancer and are diagnosed. Thus, P(D) = P(D|C)*P(C) + P(D|not C)* P(not C) = 0.99*0.005 + 0.01* .995 =.0149

Using Bayes’ Rule:

P(have cancer given the MRI result shows cancer) = P(C|D) = P(D|C)*P(C)/ P(D) = 0.99 * .005 / .0149 = .33 or about 1/3.

Thus, a very effective MRI test for cancer, which gives the wrong result only 1% of the time, is still suspect when it gives a result of cancer. In fact, an MRI diagnosis of cancer indicates only a 1/3 probability of actually having cancer (keep in mind while there are indications that false positives I used here for the MRI are made up, though they do appear to be at least in the 1% range).

It’s easy to understand what happens logically when you imagine that 200 women come in for screening. Only 1 will probably have cancer, since the cancer rate is about 1 in 200. The MRI will almost surely diagnose her (99% chance). For the other 199, the MRI will indicate no cancer for all but about 1%, which means it will indicate cancer for about 2 of them. Thus, of the 3 cases where the MRI indicates cancer, 2 of them will be false indications.

## Thursday, May 8, 2008

### Why are there too many boys in China?

It's clear to most that the combination of the one child law, preventing most chinese couples from having more than one child, and the preference in China for boys, is driving this (though there are other explanations, including the possibilities of different effects of some diseases: see this business week article).

There are two sinister mechanisms for ensuring that your only child is a boy: selective abortion or infanticide. Yet there is another option: just have another baby if the first is a girl, and don't tell the government. I think this third option is more likely, because I do not think most families can afford an abortion (illegal for sex selection) and very few mothers would kill their babies.

So how much does this non-reporting need to happen to change the ratio from the normal 106 to 100 male to female births to the abnormal 120 to 100?

The answer to this is the combination of three things: 1) percent of births that are girls (with no intervention), 2) percent of families that have another baby (hoping for a boy), given the first is a girl, and 3) the percent of families that do not tell the government about the first baby.

Lets call these percentages, Pg, P2, and Ps (for girl, 2nd child, and secret). Lets also call Pr the reported percent of girls, which is 100/220, or 45.45%. We'll assume also for simplicity that families quit trying when they have a boy or have 2 children, whichever comes first. Also, we'll assume families always report the first child if it is a boy or if they have no more children.

Pg is known at around 100/206=48.54%

P2 and Ps are unknown.

We want to figure out what P2 and Ps could lead to the Pr being 45.45% when Pg is 48.54%.

First, consider that, given the ground rules above, the following are the types of families that can exist (in birth order):

B (boy, one child only)

G (girl, one child only)

GB (girl boy, two children)

GG (girl girl, two children)

To figure out the percent of girls reported, we need the total girls reported divided by the total children reported. This is easy to figure out for each combination above:

B = 0 girls / 1 child

G = 1 Girl / 1 child

GB = 0 girls / 1 child Ps percent of the time and 1 girls / 2 children (1- Ps percent of the time)

GG = 1 girl / 1 child Ps percent of the time and 2 girls / 2 children (1-Ps percent of the time)

We are almost there. Now we just need to sum the numerators multiplied by their probabilities and the denominators multiplied by their probabilities. Here are the probabilities of each family combination:

B = 1-Pg

G = Pg*(1 - P2) ==> It's just Pg times the percent of families who do not have more children

GB = Pg*(P2)*(1-Pg) It's the chances of a girl, followed by the decision to have a 2nd, followed by having a boy.

GG = Pg*(P2)*Pg=Pg^2*P2

Thus the numerator (number of girls reported average is):

Num = (1-Pg)*0 +

Pg*(1-P2)*1 +

Pg*P2*(1-Pg)*Ps*0 +

Pg*P2*(1-Pg)*(1-Ps)*1 +

Pg^2*P2*Ps*1 +

Pg^2*P2*(1-Ps)*2

and the denominator (number of children reported on average):

Den = (1-Pg)*1 +

Pg*(1-P2)*1 +

Pg*P2*(1-Pg)*Ps*1 +

Pg*P2*(1-Pg)*(1-Ps)*2 +

Pg^2*P2*Ps*1 +

Pg^2*P2*(1-Ps)*2

We know that, in China, Pr= Num/Den = 45.45% and that, in general, Pg=48.54%. Thus, we can solve .4545=Num/Den in terms of Ps and P2.

Since we have 1 equations and 2 unknowns, there are an infinite number of solutions, but here are a few possibilities:

0% have a second child --impossible

10% have a second child -- impossible

15% have a second child and 85% of those keep the first a secret from the government

20% have a second child and 65% of those keep the first a secret from the government

30% have a second child and 45% of those keep the first a secret from the government

40% have a second child and 35% of those keep the first a secret from the government

50% have a second child and 30% of those keep the first a secret from the government

One thing to note (that is not necessarily obvious in these calculations) is that if everyone reports all the children they have (Ps=0), then the percent of girls will be exactly 48.54%, the same as if everyone had one child, as long as infanticide and selective abortion are not occurring.

But the main point here is that a small number (15%) of couples having second children and not reporting the first girl leads to the warped percentages of baby girls, if there is high under-reporting of these first children. You do not need to assume that infanticide or selective abortion plays a role at all.

## Tuesday, April 15, 2008

### What's the chance of rain?

When it comes to data, I've always felt more is more, and so if I am really interested in the weather, I go to the NWS site where I'll get more than just "chance of rain" or "a few showers."

But what does it mean when the weather forecast says there is a 30% chance of rain Wednesday and a 50% chance of rain Thursday? If we focus on a single time period, say Thursday, the conclusion if pretty clear: there's a 50-50 chance of rain. Put another way, when encountering conditions like this in the past, the NWS model data shows rain half the time and no rain half the time.

The inference becomes more difficult when we want to ask a more complex question. For example, suppose I'm going to Chicago Wednesday and returning Thursday night. I want to know whether to bring an umbrella. Since I hate lugging an umbrella along, I only want to bring one if there is at least a 75% chance of rain at some point while I'm there.

It turns out that the answer to this question cannot be determined with the information given (don't you just love that choice on multiple choice tests?).

Before we explain why, though, we need some definitions and notation.

To do the math for this, we generally define each possible outcome as an

*event*. In this case, we have the following events:

Event A: Rains Wednesday

Event B: Rains Thursday

We are interested in the chance that either Event A or Event B occurs. We have a shorthand for expressing the probability of Event A: "P(A)".

There is a simple probability formula that is very useful here:

P(A or B) = P(A) + P(B) - P(A and B)

This formula says that the probability of Event A or Event B happening is the probability of A plus the probability of B minus the probability that A and B both happened (the event that A and B occurred is called the

*intersection*of Events A and B). This makes sense because if we just added them (as you might intuitively do) we are double counting the times both events occur, and thus we need to subtract out the intersection once at the end.

In some cases P(A and B)=0. In other words, Events A and B never occur together. You may have noticed this comes up when you toss a coin: it is never both heads and tails at the same time (except for that time I got the coin to stand on its side). Events like these are called mutually exclusive.

In other cases P(A and B)=P(A)*P(B). This means the probability of A and B is the product of the probabilities A and B. In this case, the two events can both occur, but they have nothing to do with each other. Events like these are called independent events.

In still other cases, P(A and B) is neither P(A) + P(B) or P(A)*P(B).

If we assume the events A and B are mutually exclusive, then there's an 80% chance (50+30) of rain either Wednesday or Thursday. This seems unlikely though, because most storms could last more than an evening.

If we assume the events A and B are independent, then there's an 65% chance of rain either Wednesday or Thursday. This is a little more complicated to calculate, because we need to figure out the chances of it raining both Wed. and thursday, which we assume is independent and thus is P(A)*P(B)=30%*50%=15%. Thus:

P(A or B)=P(A) + P(B) - P(A and B)=50% + 30% -15%=65%.

[we could also figure out the chances of not raining either night. Since the chance of rain is independent, the chance of no rain is also independent. Also, the chance of rain plus the chance of no rain must be one. Thus P(no rain Wednesday)=1-P(A)=100%-30%=70%. Similary, P(not B)=100%-50%=50%. Then, the chance of no rain either time period = P(not A and not B)=70%*50%=35%. Thus, there is a 35% chance it will not rain either night, and we can conclude there would be a 65% chance of rain one of the nights, of course all hinging on the independence assumption].

okay. So finally, we have two probabilities 80% and 65%, based on two different and rather extreme assumptions. On the high side, 80% is the most extreme. We can see this by seeing that in order to get a larger number, we'd have to plug in a negative probability for the value of P(A and B) in the general formula (which does not assume independence or anything else):

P(A or B) = P(A) + P(B) - P(A and B)

Since probabilities must always be at least 0 and at most 100%, we cannot have a negative number for P(A and B). So at most, the chance of rain Wednesday or Thursday is 80%.

But what about the least the chance might be? Independence seems a pretty extreme assumption in the other direction, but in fact it is not. What would lead to the smallest probability is if the two events A and B were highly related--in fact so related that P(B)=1 if A occurs. This would mean the P(A and B)=30% (the smaller of P(A) and P(B)). This would lead to a probability that it rains either Wednesday or Thursday of just 50%:

P(A or B) = P(A) + P(B) - P(A and B) = 30% + 50% - 30% = 50%

So now that we've got rain down, let me go back to the original impetus for this blog: it is easy to make the wrong inference when given information about the chances of a series of events.

The recent Freakonomics blog about chances of birth defects addresses this issue. In it, Steven Levitt describes a couple who was told that there was a 1 in 10 chance that a test, which showed an embryo was not vailable, was wrong. The test was done twice on each of two embryos, and all four times the outcome was that the embryos were not viable. Thus, the lab told them that the chances of having healthy twins from two such embryos was 1 in 10,000. Of course, after reading about rain above, you recognize this as the application of the independence assumption (1/10 times 1/10 times 1/10 times 1/10 equals 1 in 10,000). The couple didn't listen to the lab though, and, nine months later, 2 very vaiable and very healthy babies were born.

Post hoc, it seems the lab should have (at least) said the chances were somewhere between 1 in 10 and 1 in 10,000. In addition, the 1 in 10 seems like an awfully round number--could it have been rounded down from some larger probability (1 in 8, 1 in 5, 1 in 3, who knows?). Levitt wonders whether the whole test is just nonsense in the first place.

So what do you do when confronted with a critical medical problem and a slew of probabilities? There's no easy answer, of course, but I believe gathering as much hard data as possible is important. Then make sure you distinguish between the inferences made by your doctor, nurse, or lab technician (which are more subject to error) and the underlying probabilities associated with the drug, the test, or the procedure (which are less subject to error).

## Wednesday, March 19, 2008

### Stocks or Bonds?

If you look at data from 1929-2007 for the Dow Jones Industrial Average (DJIA), a 20 year investment in stocks yielded an inflation-adjusted return of about 2.6% annually. This is before taking dividends into account, which add at least a couple percent (recently, the dividend yield has been closer to 2% while in the past 4-5% was the norm, see this article for relevant charts). The net return for stocks, after inflation, is around 6% annually.

For bonds, the returns also generally beat inflation, but are not as good. Their average is around 2% annually, after subtracting out inflation.

The following chart shows the DJIA inflation adjusted for 10 and 20 year investments (in blue and purple, downloaded from Yahoo Finance), versus Treasury bonds (in yellow). The US Treasury Bond return is based on 20-year bonds when available from the Fed and 10 year bonds or estimating using the article sited above when 20-year returns were not available.

The year shown is the final year of the investment. Thus, if you made a 20 year investment beginning in 1985, you can look at the points corresponding to 2005 to find out that you would have earned approximately 10% annually, whether it was in bonds (yellow) or stocks (purple for 20 year) by the start of 2005, and that is after inflation.

[Click on the graph see it in higher resolution]

The yellow line, denoting Treasury bond returns, is mostly below the blue and purple lines, indicating that, for most years, a 10 or 20 year investment in the Dow Jones Index is better than a similar investment in Treasury bonds. In fact, for 57 of the 68 ten-year investments, stocks do better. For 20 year investments, the numbers are even more promising for stocks, which perform better for 56 out of 57 20-year investments.

Treasury Bonds, on the other hand, are considered risk-free. The idea is that there is no risk that the U.S. Treasury will default on its loan (or, at least, if it does default, there are far more serious problems to worry about). On the other hand, with stocks, there are no guarantees that they will not go down.

This basic idea, that stocks have broadly out-performed bonds and beaten inflation in the long run, seems to be well-understood. This fact, however, does not imply that individual stock investments will outperform bonds, or that a shorter term investment will outperform.

The "truth" about stocks that is implied by this graph, however, is a little deceiving, for three reasons.

**1. The graph tells you little about what might happen to a 20-year investment beginning in a particular year**

The graph shows that for most time periods, stocks do well, but there is an enormous amount of variation, even for the long-term investments considered here. If you happened to make some long-term investments in 1962, at the end of a recession and the beginning of a long boom, you'd still be out of luck if you needed that money in 1982, when its real value would have been far less than when you invested it (the pink point corresponding to 1982 shows a 1% annual loss for the 20 prior years, amounting to about an 18% total loss in real dollars).

[The "safe" Treasury bond portfolio, however, would have done far worse, losing about twice as much over the same period. This is not because the US defaulted on its bonds. Instead, it's because of inflation. The 20-year bond you bought in 1962, yielding 4%, did not keep up with inflation. This is the long-term, somewhat hidden risk for bonds.]

**2. The graph averages over a portfolio of stocks**

The other issue is that despite the graphs very clear implication that stocks are better, even in bad times, this does not necessarily imply that individual stocks are better. Returns on an individual stock, or even on a small portfolio of stocks, vary much more wildy than the Dow Jones average shown in the graph. Also, in recent years, the Treasury has issued inflation-indexed bonds, which guarantee a *real* return above 0, thus insuring the yellow line in a future graph will be more than 0 (see information on inflation-indexed bonds).

**3. The future is not now**

While it seems convincing that 80 years of history show stocks in a very positive light, we only need to look at Japan, whose Nikkei average, since 1985, has lost about 15% adjusting for inflation (it's far more than 50% for an investment made near Japan's stock market peak). There are some indications that the current U.S. problems (real estate boom and bust, credit problems) are worse than Japan's. Much of U.S. investment in the last few decades has been fueled by the safety of the dollar. The rise of the Euro and of globalization has already begun to change that, and a continuing fall in the dollar will almost certainly cause inflation, which was devastating to the stock market in the 1970's.

**Bottom Line**

There's a lot of evidence that, in the past, a long-term stock investment paid off, relative to both risk-free bonds and inflation. However, there is no guarantee this party will continue, especially if there is a sea-change in dollar investments.

Where's my retirement money? Almost all in stocks...but almost all foreign stocks.

## Sunday, February 24, 2008

### 7 letter words and 8 card suits

As all scrabble players know, if you use all 7 of your letters, you get a bonus. However, except for the first turn, you'd need an 8-letter word to achieve this--ETAERIO would not do. Thus, I am going to try to answer the question: "What are the chances of getting the letters ETAERIO on your initial turn in scrabble (you also have to hope you are first)?" I have no hope of finding out whether this is the most likely seven letter word, because I can't automatically check all letter combinations, but I will try to give some guidance there as well.

To find the chances of gettting ETAERIO, we need the number of combinations that produce these letters divided by the total number of combinations. In other words, we have to go back to 12-th grade math, where we all learned (or sort-of learned) permutations and combinations.

There are 100 tiles in scrabble and we are choosing 7. Thus, there are 100 ways to to choose the first tile, 99 ways to choose the second, and so forth down to 94. If we chose them in order, we'd have 100*99*98*97*96*95*94 permutations. However, we don't care about the order, so we have to take the above product and divide by the number of ways we can permute the 7 tiles, which is 7*6*5*4*3*2*1. The shorthand way to express this number of combinations is "100 choose 7" or

By dividing the two ratios above ((100*99*98*97*96*95*94) / (7*6*5*4*3*2*1)), we come up with 16,007,560,800. Since most letters appear multiple times, the number of possible letter combinations is far less, and to know the chances of getting ETAERIO, I need to know how many times each letter appears.

Thus, I found the letter distributions on Wikipedia (counting our own scrabble pieces would probably not do with a three-year old distributing them around the house). The most common ones as follows:

E - 12 tiles

A, I - 9 tiles

O - 8 tiles

N, R, T - 6 tiles

D, L, S, and U - 4 tiles

other letters - 3 or less, but not relevant here

To figure out the chances of getting ETAERIO, we need to know the number of combinations that produce it. We need 2 E's, 1 T, 1 A, 1 R, 1 I, and 1 O. It turns out that the number of ways is the product of each of these implied combinations. Thus, it is "12 choose 2" (E's) times "6 choose 1" (T) times "9 choose 1" (A) and so forth. This comes out to 1,539,648 ways to get the letters in ETAERIO. If we divide this by the total number of combinations (16,007,560,800), we find that there is about a 1 in 10,000 chance of getting ETAERIO as your first 7 letters. Of course, from there, you have to know it is a word and figure out that you can make that word from those letters, since they are not likely to appear in that order.

I could not find another word with a higher probability, but I did find TREASON and TRAINED (both about 1 in 20,000). However, It's clear from the distribution of letter tiles that in order to find a word that beats ETAERIO, you can only use letters appearing in 6 or more tiles.

**BRIDGE HANDS**

Now that we all remember the mechanics of combinations (or at least, we are on the subject of them), let's investigate another oft-asked question around here: what's the chance of being dealt a 7 card suit in bridge? This would be 4 (number of suits) times "13 choose 7" (ways to choose 7 from a suit) times "39 choose 6" (ways to choose the other 6 from the other 3 suits) divided by "52 choose 13" (ways to choose 13 cards from 52). This comes out to about 3.5%, or 3 or 4 times in every 100 hands.

For an 8-card suit, it is 1 in about 200. For a 9-card suit, it is about 1 in 2,700. Of course, my kids are always asking about the chances of being dealt a 10 card suit or even a 13-card suit:

10-card suit: 1 in 60,738

11-card suit: 1 in 2,746,693 (less than 1 in million)

12-card suit: 1 in 313,123,057 (less than 1 in 300 million)

13-card suit: 1 in 158,753,389,900 (less than 1 in 150 billion)

The chances aren't too great, but with some really poor shuffling, they've managed the 13-card suit once or twice.

## Tuesday, February 12, 2008

### Rent or Buy?

**Housing as an Investment**

Forgetting for the moment about the psychological advantages and disadvantages of buying versus renting, let's look at the Economics, and, of course, the probabilities.

One of the most fascinating pieces of information to answer this question is a chart put together by Robert Shiller (Yale Economist) as part of his book Irrational Exuberance (he also has an article on the housing market with similar charts in Economists' Voice, March, 2006). Shiller looks at inflation-adjusted housing prices from 1890 to the present. I got the chart from this site, and the whole article is free and downloadable here (search on Shiller).

Shiller sets the price of a house in 1890 at 100, and shows how the value varies over time, adjusting for inflation. Thus, in 1947, soon after the war, the value is 110, 10% higher than in 1890. In 1989, the peak of the last boom, it is around 125. And now? Around 200!

Besides the obvious "irrational exuberance" of the housing market that is indicated in this graph, another interesting fact comes out: Housing goes up and down over time and in any given 20 or 30 year period, can be either a good investment or a bad one. Sure, if you timed the last couple of booms correctly, you could have made a killing, but the fact is that a house bought in 1960 was basically the same price in 1995, after accounting for inflation. Of course, the person who bought that house could have lived there for 35 years, paying only the cost of upkeep, and, presumably the mortgage.

**Total Return to Renting Versus Buying**

That brings us to the next topic: knowing that a house may or may not give you any real capital appreciation, is it better to buy or rent?

One of the big arguments I hear from my mother-in-law against renting is that "you are just throwing your money away." Seems like a good point. With renting, you get absolutely nothing out of it, but with buying, after 30 years, you own a house. The problem with this argument is it ignores two things: 1) the down payment, and 2) interest.

When you buy a house, you put around 20% down. That money then cannot be invested elsewhere. In addition, you pay interest on a loan, whose proceeds are invested in the house. The good thing is that you are using the proceeds from your loan to buy an asset 5 times the value of what you invested in cash. For example, if you buy a $500,000 house, you only have to pay $100,000. Thus, if you are in one of the boom times in Shiller's graph, you get 5 times what his graph shows in return on your $100,000 investment. The flip side, of course, is that in the bust times, you get 5 times the losses.

Compare this to renting. Here, you keep your $100,000, perhaps investing it in safe 5-year Treasury notes, where you can expect an inflation adjusted return of about 2.5% (see the fed site for T-note rates and the Bureau of Labor Statistics site for inflation rates--the real return is lower if you go back more than 40 years).

Now, let's look at renting or buying with specific numbers. Suppose that you have a $500,000 home in mind. For buying the house, your total costs are your mortgage and your return is the amount you get back after the sale, minus the $100,000 you paid as a down payment. For renting, your costs are your rent and your return is the cash you get in interest from your $100,000 investment.

The following table lays out 6 scenarios (click on the table to see a legible version). I've adjusted for the tax benefits of the mortgage as well as inflation.

I varied the mortgage interest rate and the (annual) house appreciation. The two numbers at the bottom are your out of pocket monthly costs after 5 and 10 years for owning the house. Presumably, this is the amount in rent you should be prepared to pay. Thus, for a house that appreciates at the rate of inflation (roughly what has happened with housing since 1890--it is the 3% appreciation in the 2nd and 5th columns), your monthly costs are $633 if you have a 6% mortgage and sell after 5 years and $1,134 if you have an 8% mortgage. You do much better if you hold the house for 10 years. Not true for the house that does not appreciate. Then, you do better if you hold it for less time.

To really answer the question well, we'd need an accurate prediction of inflation and mortgage rates. In the short term, both of these are pretty easy. The second thing you need is an idea of how long you will hold the house. If the house appreciates at the rate of inflation, then you are better off holding onto it for longer. If it does not, then you are better off selling sooner.

What is apparent is how radically the value changes depending on the assumptions about how much the house will appreciate. Put in some negative numbers and it really gets scary. If the current boom results in a 2% annual nominal depreciation over the next 5 years, then your 5-year monthly cost goes to about $2,500. At 5% a year depreciation (something that occurred over several years with NYC apartments in the early 90s), then your out of pocket is $3,500 per month if you sell after 5 years.

So, should you rent or buy?

Well...it depends.

## Monday, January 21, 2008

### Throw away your cold medicine?

The Times gets this first part wrong. The "seawater" group got both standard medications and seawater, as explained by the journal article on the study. So, right away, we are not talking about throwing away our cold medicines. Instead we might have to buy (for those of us not close to the ocean) seawater (more below about saline vs. seawater).

But the study does have some interesting results.

**Here's the good part**

The results for the preventative success are most striking: at week 12, 25% of the children in the control group had reported illnesses that caused an absence from school versus only 8% in the treatment group. This is statistically significant, meaning the results were too large to be explained away by mere chance. This does not mean, however, that biases in the study could not have caused the difference (no matter how statistically significant, bias, if it exists, can mean an otherwise statistically significant difference is spurious).

**Here's the bad part**

1. The study was not blind. This means that the children (and physicians and parents) were aware of whether the kids were taking the saline solution or not, subjecting the study to a "placebo" effect: kids who were taking the saline might have 'felt' better, but had no less incidence of a cold. The study's authors make an error in the journal article by stating: "the large number of participants, multi center design, and consistence of results between individual parameters (assessed by physician, patient, and parent) lower the risk of bias."

Bias is not mitigated by sample size -- that is, a large biased group is no better than a small biased group (imagine trying to figure out average height of all men by taking an NBA team, then doing a second study with the average of all NBA teams, saying this lessens the bias).

Similarly, having three biased parties (physician, patient, and parent) would only reduce the bias if we were comparing it to the bias of the most biased party (say, the parent).

2. The treatment is no fun.

Perhaps as important as the questionable effect of the study due to bias is the fact that the treatment involves the solution being squirted into the kid's nose 3 times a day for 12 weeks. I was sort of amazed that the study had so few dropouts (only 11 out of 401 patients in the study dropped out). The unpleasantness seems barely worth the avoidance of a cold or two.

3. Seawater, water or saline-- does it matter? The study does not provide any comparison of the seawater spray to other sprays, or even a simple water spray. To its credit, the journal article does not focus on the fact that the solution was seawater but instead on the comparison of nasal wash versus no nasal wash. In the journal article, there is little indication that seawater would be any better than salt water (the word seawater is mentioned 10 times in the journal article as opposed to saline, which is mentioned 64 times). BTW, the seawater in the study was processed (and presumably sterilized), so don't take it literally and go to your neighborhood polluted beach for your solution.

The Times article, however, focuses on the idea of seawater, as opposed to a simple saline (salt-water) solution.

**Net Net**

It does seem that washing your nose out with saline 3 times a day will make your kids feel better and they will miss less school. But it's unclear whether they are actually any less sick or they just think they are less sick (to that end, maybe giving them a sugar pill each day, telling them it was a special cold pill, would have the same effect).

## Thursday, January 10, 2008

### Election Math - Update

Unfortunately, there is really no way to tell. However, there are two interesting things to note.

One is that in one local poll, they show the percentages both including and excluding those leaning toward a candidate but are undecided. In this case, Obama received 2% more of the vote if the leaners are counted, indicating there is some bias in the method that most polls use, which is to count the leaners (who are actually still undecided) rather than exclude them (the polls do exclude the truly undecided, which has hovered in the 5-10% range, but implicitly assume they will vote the same way as the decided voters).

The second thing to note is that the polls very accurately predicted Obama's percentage and inaccurately predicted Hillary's. Obama's actual percentage was 36%, and the seven major polls predicted 36%, 35%, 39%,41%,34%,38%, and 39% (an average of 37%). Hillary's percentages (same web page) were 28%, 34%,28%,28%,31%,29%, and 29% (average of 30%) versus her actual of 39% (a difference that is outside of the zone for mere statistical error). Hillary apparently picked up votes from undecideds, people who were leaning for Obama, and from the Edwards/Richardson camps.

## Monday, January 7, 2008

### Election Math

Did so many people change their minds in one day? If we ignore other polls, then the statistical evidence is not conclusive. Why? Because the margin of error in the 39-29 poll is 5%, meaning that the Obama's percentage is likely somewhere between 34 and 44%. The Saturday poll, also with a 5% margin of error, indicates his percentage is between 28 and 38%. The overlap between these two ranges means the numbers might not have changed at all. Rather, the difference is mere statistical error, which is an artificat of the sampling. By luck of the draw, the Sunday poll may have found more Obama supporters, even though no one changed their mind.

When comparing two polls taken independently (as above), the error is more than the stated error (5% above) but less than the sum of the stated error in the two polls (5%+5%=10% above). We can compute the error of the difference in these two polls as 7%, which means that Obama's numbers, which appeared to go up by 6% (from 33% to 39%), may have gone down by as much as 1% or up by as much as 13%.

*Some math: The 7% is computed as the 1.96 multiplied by the square root of the sum of the squared standard deviations of the polls. The standard deviation of the poll is the error rate (5%) divided by 1.96, or about 2.5%. In a standard probability distribution called a Normal Distribution, 95% of the data falls between plus or minus 1.96 standard deviations from the mean. Thus, in the latest poll, the mean for Obama was 39%, with a standard deviation of 2.5%.*

Several implicit assumptions are made in computing the error rate in these polls, primarily summarized as: 1) the Normal Distribution is appropriate, 2) the sample is a random sample of all who will vote in the Tuesday Democratic primary, and 3) the answers in this poll are reflective of the way the voters will actually vote come Tuesday.

**Assumption Review**

1) Normal Distribution. This one is easy. For a large, random sample and a multiple choice question (who will you vote for), this assumption is always close to reality except when the number polled is very small or the percentages (as are those for say, Kucinich) are close to 0% or 100%. For Clinton and Obama, there is no real issue here, since the sample size is moderately large and their percentages are in the 30-40% range (a neat demo that shows how close a distribution is to Normal, depending on the sample size and percentage, is at http://www.ruf.rice.edu/~lane/stat_sim/binom_demo.html).

2) Random Sample. This is more difficult. Suppose Clinton voters go to chuch on Sunday, followed by lunch, while the Obama voters are home-bodies. This problem can be called selection bias. If there is church-lunch/home-body selection bias, then, in the Sunday poll, a random dialing of phone numbers would have surfaced more Obama voters and would not have been a random sample of Tuesday's voters, as opposed to Saturday, where you might have gotten more equality. [There is generally a second underlying issues of refusals--people who refuse to be polled. If these voters are more likely to vote for Clinton, then Clinton's numbers will be under-stated, but this would be true for both polls that we are comparing.]

3) The difference between how people said they felt in today's poll versus how they will vote in Tuesday's election. I find this difference, which can be called measurement error, the most troublesome. Take, for example, the fact that, in the CNN Saturday poll, one-fourth of voters stated that they had not yet decided, another quarter were only leaning toward someone, and just half had definitely decided. This is, of course, only what people are saying, and often people do not want to admit indecision, so the true numbers of undecided may even be higher. Still, if the undecided's vote even 60-40 in favor of hillary, it would erase the 10% lead of Obama.

The 5% error rate (and 7% error of difference rate) does not take the above issues into account. It implicitly assumes they will have no effect. Thus, the true error rate in election polls is likely far higher.If we consider other polls, the Obama lead, and the change seems to be clearer. In the seven polls published th 6th of January, Obama has an average lead of about 2-3%. In the 5 polls published Friday and Saturday, Clinton led in all of them, by around 5 points. We can prove this change, from pre-Sunday to Sunday, is statistically significant. However, because of the selection bias and measurement error issues above, it may not be indicative of the outcome on Tuesday.

My personal guess? Obama by a good margin...but a lot can happen in a day.

## Tuesday, January 1, 2008

### Where is the safest place to live?

Crime rates and murder rates are typically adjusted for population by reporting the number of crimes per 100,000 people. In Israel this rate (for murders) is about 3, as compared to about 6 per 100,000 in the U.S. Even including terror attacks, the rate has never been as high in Israel as in the U.S. In England, the rate is less than 2 per 100,000, which is in line with most of Western Europe.

All this is fine and good if you are willing to live abroad just to enjoy a lower crime rate, but, assuming you are looking for a nice place in the US to live, what city might suit your needs vis-a-vis lack of crime?

I considered the four cities I have lived in (see the FBI site for all the stats), Columbia, SC; Washington, DC; Phildalephia; and New York. Of these, New York (sorry Mom) is clearly the safest, with a murder rate of 6 per 100,000 and a violent crime rate of about 0.7% (less than 1 in 100). The worst is DC (murder: 29, violent crime 1.5%), but Philadelphia is a near tie (26, 1.5%). Columbia, my home town and my parents' current and future town, is in the middle (13 murders per 100,000 and 1.1% violent crime rate).

These stark differences mask three important facts. First, the chances of a random individual being a victim of a violent crime in any given year, is very low, no matter what the city (though it's more likey to get killed by a person than a car in Philly and DC, according to National Safety Council statistics--

*originally linked here but apparently not on the web anymore as of at least 11/28/2012*).

Second, most violent crimes are committed by someone we know, and most of the people reading this blog do not have violent acquaintances, plus most murders are committed with guns, and most people reading this blog do not have acquaintances with guns.

Third, crime is highly localized, and no matter which city we live in, we are likely to live in a more affluent area, and these areas have much lower crime.

So, in conclusion: don't worry, Mom, New York is safer than Columbia, as long as I can keep making a living!