Monday, March 23, 2009

How big a sample?

Suppose we want to figure out what percentage of BIGbank's 1,000,000 loans are bad. We also want to look at smallbank, with 100,000 loans. Many people seem to think you'd need to look at 10 times as many loans from BIGbank as you would for smallbank.

The fact is that you would use the same size sample, in almost all practical circumstances, for the two populations above. Ditto if the population were 100,000,000 or 1,000.

The reasons for this, and the concept behind it, go back to the early part of the 20th century when modern experimental methods were developed by (Sir) Ronald A. Fisher. Though Wikipedia correctly sites Fisher in its entry on experimental design, the seminal book, Design of Experiments, is out of stock at Amazon (for $157.50, you can get a re-print of this and two other texts together in a single book). Luckily, for a mere $15.30, you can get David Salsburg's (no relation and he spells my name wrong! ;-) ) A Lady Tasting Tea, which talks about Fisher's work. Maybe this is why no one knows this important fact about sample size--because we statisticians have bought up all the books that you would otherwise be breaking down the doors (or clogging the internet) to buy. Fisher developed the idea of using randomization to create a mathematical and probability framework around making inferences of data. In English? He figured out a great way to do experiments, and this idea, or randomization, is what allows us to make statistical inferences about all sorts of things (and the lack of randomization is what sometimes makes it very difficult to prove otherwise obvious things).

Why doesn't (population) size matter?
To answer this question, we have to use the concept of randomization, as developed by Fisher. First, let's think about the million loans we want to know about at BIGbank. Each of them is no doubt very different, and we could probably group them into thousands of different categories. Yet, let's ignore that and just look at the two categories we care about: 1) good loan or 2) bad loan. Now, with enough time studying a given loan, suppose we can reasonably make a determination about which category it falls into. Thus, if we had enough time, we could look at the million loans and figure out that G% are good and B% (100% - G%) are bad.

Now suppose that we took BIGbank's loan database (ok, we need to assume they know who they loaned money to), and randomly sampled 100 loans from it. Now, stop for a second. Take a deep breath. You have just entered probability bliss -- all with that one word, randomly. The beauty to what we've just done is that we've taken a million disparate loans and with them, formed a set of 100 "good"s and "bad"s, that are identical in their probability distribution. This means that each of the 100 sampled loans that we are about to draw has exactly a G% chance of being a good one and a B% chance of being a bad one, corresponding to the actual proportions in the population of 1,000,000.

If this makes sense so far, skip this paragraph. Otherwise, envision the million loans as quarters lying on a football field. Quarters heads up denote good loans and quarters tails up denote bad loans. We randomly select a single coin. What chance does it have of being heads up? G%, of course, because exactly G% of the million are heads up and we had an equal chance of selecting each one.

Now, once we actually select (and look at) one of the coins, the chances for the second selection change slightly, because where we had G% exactly, now there is one less quarter to choose from, so we have to adjust accordingly. However, that adjustment is very slight. Suppose, G were 90%. Then, we'd have, for the second selection, if the first were a good coin, a 899999/999999 chance of selecting another good one (that's an 89.99999% chance instead of a 90% chance). For smallbank, we'd be looking at a whopping reduction to an 89.9999% chance from a 90% chance. This gives an inkling of why population size, as long as it is much bigger than sample size, doesn't much matter.

So, now we have a sample set of 100 loans. We find that 80 are good and 20 are bad. Right off, we know that, whether dealing with the 100,000 population or the 1,000,000 population, that our best guess for the percentage of good loans, G, is 80%. That is because of how we selected our sample. It doesn't matter one bit how different the loans are. They are just quarters on a football field. It follows from the fact that we selected them randomly.

We also can calculate several other facts, based on this sample. For example, if the actual number of good loans were 90% (900,000 out of 1,000,000), we'd get 80 or fewer in our sample of 100 only 0.1977% of the time. The corresponding figure, if we had sampled from the population of 100,000 (and had 90,000 good loans), would be 0.1968%. What does this lead us to conclude? Very likely, the proportion of "good" loans is less than 90%. We can continue to do this calculation for different possible values of G:
If G were 89%: .586% of the time would you get 80 or fewer.
If G were 88%: 1.47% of the time would you get 80 or fewer.
If G were 87%: 3.12% of the time would you get 80 or fewer.
If G were 86.3%: 5.0% of the time would you get 80 or fewer.
If G were 86%: 6.14% of the time would you get 80 or fewer.
In each of the above cases, the difference between a population of 1,000,000 and 100,000 loans makes a difference only at the second decimal place, if that.

Such a process allows us to create something called a confidence interval. A confidence interval kind of turns this calculation on its head and says, "Hey, if we only get 80 or fewer in a sample 1.47% of the time when the population is 88% good, and I got only 80 good loans in my sample, it doesn't sound too likely that the population is 88% good." The question then becomes, at what percentage would you start to worry?

For absolutely no reason at all (and I mean that), people seem to like to limit this percent to 5%. Thus, in the example above, most would allow that, if we estimated G such that 5% (or more) of the time, 80 or fewer of 100 loans would be good (where 80 is the number of good in our sample), then they would feel comfortable. Thus, for the above, we would say, with "95% confidence, 86.3% or fewer of the loans in the population are good." If we also want a lower bound on the percent of G loans, we could calculate the percent of G such that there is a 5% chance that 80 or more loans in a sample of 100 would be good.  This percentage is 72.3%, and we could say that "with 95% confidence, 72.3% or more of the loans in the population are good."  We can combine these two 95% confidence intervals into a 90% confidence interval, since the percentages not included, of 5% in each of the two intervals add to 10%.  We can thus say: "with 90% confidence, between 72.3% and 86.3% of the loans in the population are good."  We can calculate the highest and lowest percent of good loans we estimate there to be in the population, with any level of confidence between 0 and 100%.  We could state the above in terms of 99% confidence or in terms of 50% confidence.  The higher the confidence, the wider the interval and the lower the confidence the narrower the interval.

Back to sample size versus population. As stated above, the population size, though 10 times bigger, doesn't makes a difference. For a given probability above, we are using the hypergeometric distribution to calculate the exact figure (the mathematics behind it are discussed some in my earlier post).

Here are some of the chances associated with a G of 85% and a sample size of 100 that yields 80 good loans or fewer.
Population infinite : 10.65443%
Population 1,000,000: 10.65331%
Population 100,000 : 10.64%
Population 10,000 : 10.54%
Population 1,000 : 9.49%
Population 500 : 8.21%
This example follows the rule of thumb: you can ignore the population size unless the sample is at least 10% of the population.


A note to the commenter regarding population size:
Anonymous, The reason that population size does barely matters is because the statistical inferences are based on the random behavior of the sample and this behavior does not depend on the population size. Suppose you randomly selected 20 people for a survey regarding preference for a black and blue dress versus a white and gold dress and all 20 preferred black and blue. Whether I told you these people were randomly selected from the state of NY or from the whole US, in either case, you would think that the preferences (state or national) clearly favor the black and blue. That intuitive feeling is because the inference you make in your mind is regarding the behavior of the sample. Statistics takes it a bit further and figures out, if the selection is truly random, how likely such an outcome would be under different scenarios. However, the key is the observed randomness in the sample and its size and not the population size. In other words, the sample size IS important because that is where you make your observations but the population is not as long as the sample is representative of it (and it will be, as long as the sample is random). 

Wednesday, March 18, 2009

7 letter scrabble word redux

A recent article by the Wall Street Journal's "Numbers Guy" has re-surfaced one of my old posts regarding scrabble. In it I said that after the first turn, you must get an 8-letter word to use all your letters, because your seven letters need to connect to an existing word.

This, of course, is not correct, as was pointed out in comments to the Numbers Guy's blog (this was also pointed out by my sister). All you need to do to use all your letters with a 7-letter word is find a place to connect that is parallel to an existing word. For example, 'weather' could be connected parallel to a word ending in 'E", since 'we' is a word.

Maybe that's why my sister won so many scrabble games against me when I was a kid.

Wednesday, March 11, 2009

Are same-sex classes better?

Yesterday's New York Times had an article, "Boys and Girls Together, Taught Separately in Public School," about same-sex classes in New York City. In particular, the article focused on P.S. 140 in the Bronx. The article looks upon such classes favorably, despite the fact that there is, as far as I can tell, no evidence that such classes lead to better achievement.

In particular, the article states: "Students of both sexes in the co-ed fifth grade did better on last year’s state tests in math and English than their counterparts in the single-sex rooms, and this year’s co-ed class had the highest percentage of students passing the state social studies exam."

In other words, the City is continuing this program, even though the evidence indicates that not only are students in same-sex classes doing no better, they are doing worse! The principal, who has introduced some programs that have achieved material results, said: "“We will do whatever works, however we can get there...we thought this would be another tool to try.” This seems reasonable, but the article states,"...unlike other programs aimed at improving student performance, there is no extra cost." There may not be a monetary cost, but making these students laboratory rats in someone's education research project doesn't help them, and, apparently in this case, hurts them. Not to mention the opportunity cost of not exposing these children to other programs that might actually help.

To be fair, the scholarly literature is not consistent in its conclusions about whether same-sex classes improve achievement. However, many of the U.S. studies showed little or no improvement. See, for example:
Singh and Vaught's study
LePore and Warren

On the other hand, some English and Australian studies indicate that, at least for girls, same-sex classes or schools may result in higher achievement (see, for example, Gillibrand E.; Robinson P.; Brawn R.; Osborn A.) while others indicate that there are no differences (see Harker).

So the literature seems to be mixed, and I would imagine there are numerous confounding factors that make this something hard to measure--for example, typical single-sex classes in New York City consist of low-income minority students, where the boys are seen as being at-risk more than the girls. Contrast with the British and other foreign studies, where the girls are the greater concern for under-achievement.

Despite this, it's questionable how long it is ethical to continue a program, like the one at P.S. 140, where the current known outcome is that boys and girls are doing worse in same-sex classes.