Monday, March 23, 2009

How big a sample?

Suppose we want to figure out what percentage of BIGbank's 1,000,000 loans are bad. We also want to look at smallbank, with 100,000 loans. Many people seem to think you'd need to look at 10 times as many loans from BIGbank as you would for smallbank.

The fact is that you would use the same size sample, in almost all practical circumstances, for the two populations above. Ditto if the population were 100,000,000 or 1,000.

The reasons for this, and the concept behind it, go back to the early part of the 20th century when modern experimental methods were developed by (Sir) Ronald A. Fisher. Though Wikipedia correctly sites Fisher in its entry on experimental design, the seminal book, Design of Experiments, is out of stock at Amazon (for $157.50, you can get a re-print of this and two other texts together in a single book). Luckily, for a mere $15.30, you can get David Salsburg's (no relation and he spells my name wrong! ;-) ) A Lady Tasting Tea, which talks about Fisher's work. Maybe this is why no one knows this important fact about sample size--because we statisticians have bought up all the books that you would otherwise be breaking down the doors (or clogging the internet) to buy. Fisher developed the idea of using randomization to create a mathematical and probability framework around making inferences of data. In English? He figured out a great way to do experiments, and this idea, or randomization, is what allows us to make statistical inferences about all sorts of things (and the lack of randomization is what sometimes makes it very difficult to prove otherwise obvious things).

Why doesn't (population) size matter?
To answer this question, we have to use the concept of randomization, as developed by Fisher. First, let's think about the million loans we want to know about at BIGbank. Each of them is no doubt very different, and we could probably group them into thousands of different categories. Yet, let's ignore that and just look at the two categories we care about: 1) good loan or 2) bad loan. Now, with enough time studying a given loan, suppose we can reasonably make a determination about which category it falls into. Thus, if we had enough time, we could look at the million loans and figure out that G% are good and B% (100% - G%) are bad.

Now suppose that we took BIGbank's loan database (ok, we need to assume they know who they loaned money to), and randomly sampled 100 loans from it. Now, stop for a second. Take a deep breath. You have just entered probability bliss -- all with that one word, randomly. The beauty to what we've just done is that we've taken a million disparate loans and with them, formed a set of 100 "good"s and "bad"s, that are identical in their probability distribution. This means that each of the 100 sampled loans that we are about to draw has exactly a G% chance of being a good one and a B% chance of being a bad one, corresponding to the actual proportions in the population of 1,000,000.

If this makes sense so far, skip this paragraph. Otherwise, envision the million loans as quarters lying on a football field. Quarters heads up denote good loans and quarters tails up denote bad loans. We randomly select a single coin. What chance does it have of being heads up? G%, of course, because exactly G% of the million are heads up and we had an equal chance of selecting each one.

Now, once we actually select (and look at) one of the coins, the chances for the second selection change slightly, because where we had G% exactly, now there is one less quarter to choose from, so we have to adjust accordingly. However, that adjustment is very slight. Suppose, G were 90%. Then, we'd have, for the second selection, if the first were a good coin, a 899999/999999 chance of selecting another good one (that's an 89.99999% chance instead of a 90% chance). For smallbank, we'd be looking at a whopping reduction to an 89.9999% chance from a 90% chance. This gives an inkling of why population size, as long as it is much bigger than sample size, doesn't much matter.

So, now we have a sample set of 100 loans. We find that 80 are good and 20 are bad. Right off, we know that, whether dealing with the 100,000 population or the 1,000,000 population, that our best guess for the percentage of good loans, G, is 80%. That is because of how we selected our sample. It doesn't matter one bit how different the loans are. They are just quarters on a football field. It follows from the fact that we selected them randomly.

We also can calculate several other facts, based on this sample. For example, if the actual number of good loans were 90% (900,000 out of 1,000,000), we'd get 80 or fewer in our sample of 100 only 0.1977% of the time. The corresponding figure, if we had sampled from the population of 100,000 (and had 90,000 good loans), would be 0.1968%. What does this lead us to conclude? Very likely, the proportion of "good" loans is less than 90%. We can continue to do this calculation for different possible values of G:
If G were 89%: .586% of the time would you get 80 or fewer.
If G were 88%: 1.47% of the time would you get 80 or fewer.
If G were 87%: 3.12% of the time would you get 80 or fewer.
If G were 86.3%: 5.0% of the time would you get 80 or fewer.
If G were 86%: 6.14% of the time would you get 80 or fewer.
In each of the above cases, the difference between a population of 1,000,000 and 100,000 loans makes a difference only at the second decimal place, if that.

Such a process allows us to create something called a confidence interval. A confidence interval kind of turns this calculation on its head and says, "Hey, if we only get 80 or fewer in a sample 1.47% of the time when the population is 88% good, and I got only 80 good loans in my sample, it doesn't sound too likely that the population is 88% good." The question then becomes, at what percentage would you start to worry?

For absolutely no reason at all (and I mean that), people seem to like to limit this percent to 5%. Thus, in the example above, most would allow that, if we estimated G such that 5% (or more) of the time, 80 or fewer of 100 loans would be good (where 80 is the number of good in our sample), then they would feel comfortable. Thus, for the above, we would say, with "95% confidence, 86.3% or fewer of the loans in the population are good." If we also want a lower bound on the percent of G loans, we could calculate the percent of G such that there is a 5% chance that 80 or more loans in a sample of 100 would be good.  This percentage is 72.3%, and we could say that "with 95% confidence, 72.3% or more of the loans in the population are good."  We can combine these two 95% confidence intervals into a 90% confidence interval, since the percentages not included, of 5% in each of the two intervals add to 10%.  We can thus say: "with 90% confidence, between 72.3% and 86.3% of the loans in the population are good."  We can calculate the highest and lowest percent of good loans we estimate there to be in the population, with any level of confidence between 0 and 100%.  We could state the above in terms of 99% confidence or in terms of 50% confidence.  The higher the confidence, the wider the interval and the lower the confidence the narrower the interval.

Back to sample size versus population. As stated above, the population size, though 10 times bigger, doesn't makes a difference. For a given probability above, we are using the hypergeometric distribution to calculate the exact figure (the mathematics behind it are discussed some in my earlier post).

Here are some of the chances associated with a G of 85% and a sample size of 100 that yields 80 good loans or fewer.
Population infinite : 10.65443%
Population 1,000,000: 10.65331%
Population 100,000 : 10.64%
Population 10,000 : 10.54%
Population 1,000 : 9.49%
Population 500 : 8.21%
This example follows the rule of thumb: you can ignore the population size unless the sample is at least 10% of the population.

A note to the commenter regarding population size:
Anonymous, The reason that population size does barely matters is because the statistical inferences are based on the random behavior of the sample and this behavior does not depend on the population size. Suppose you randomly selected 20 people for a survey regarding preference for a black and blue dress versus a white and gold dress and all 20 preferred black and blue. Whether I told you these people were randomly selected from the state of NY or from the whole US, in either case, you would think that the preferences (state or national) clearly favor the black and blue. That intuitive feeling is because the inference you make in your mind is regarding the behavior of the sample. Statistics takes it a bit further and figures out, if the selection is truly random, how likely such an outcome would be under different scenarios. However, the key is the observed randomness in the sample and its size and not the population size. In other words, the sample size IS important because that is where you make your observations but the population is not as long as the sample is representative of it (and it will be, as long as the sample is random). 


Potomoc said...

Great post! I enjoyed the read! Can you recommend a good book on predictive modeling using SAS? I'm looking for something practical.

Alan Salzberg said...

I'm glad you like the post, but I am sorry to say I really don't know what's out there about SAS. I expect there would be a lot, but I almost never use SAS, so dont know what's good and what's not.

Anonymous said...

thanks for this information. I am still confused a bit about why population size doesn't matter. Why would the sample size be nearly the same regardless of whether the local population is 1,000,000 or 10,000,000? Sorry for the confusion-- not a quant person!