Sampling and Estimation

What’s sampling for?

Well, I told you one use in the last section. You can get well behaved statistics from badly behaved populations.
But the usual reason for sampling is convenience. If you’re studying a very large population of subjects, it may be inconvenient, or downright impossible to look at every individual in the sample. So you may want to only look at a smaller subset of the population.

Can you do that? A sample isn’t a population. Can you generalize from a part of a population to the whole population?

Population and sample

Every ten years, the United States cranks up a massive effort to count every citizen in the United States and determine specific characteristics of each. It’s called a census and it’s supposed to collect data of every member of the United States population. Few other organizations have the kind of resources necessary to poll every member of such a huge collection of people. Businesses would like to, to understand how consumers view their products. Politicians would like to, to understand how likely they are to be elected or reelected. It would be nice to be able to study every cardiac patient in the world to fully understand heart disease. But only governments have the machines to undertake such a huge task. So, if you don’t have the resources, what do you do?

Well, censuses collect information about populations. Of course, even the United States census misses a few people but the intent is to, as nearly as possible, count everyone. If you can’t reach everyone, or even nearly every one (over 320 million people is a lot of people), you opt for a survey. Censuses collect information about populations – surveys collect information about samples. Samples are subgroups of populations. The trick is to draw a sample to study that looks, as much as possible, like the population being studied.

What is random, again?

If you place all the values from a data distribution into an urn (there’s that urn again) and draw one value out, making very sure that every individual data value has an equal chance of being drawn, the draw is random. If you begin drawing values in the same way, placing them in a line until all the values are out, the resulting order is random.
So, how can there be a variety of different kinds of variates. For instance, what is a normal variate?

That has to do with how many data values are in the mix in certain intervals. If there is a middling value around which a lot of values cluster, and the number of values in particular intervals peter out as they move farther away from that central value, then you will be more likely to draw one of the central values – because there are more of them present in the mix. There is still an equal chance that any one value will be drawn.

Random and nonrandom samples

I remember hearing a report on National Public Radio in the first decade of the 21st century that the medical community had realized that most medical studies to determine drug dosage had been performed on male subjects. What’s wrong with that? Female physiology can differ significantly from male physiology. But men are easier to collect as subjects and, until recently, you didn’t subject women to the risky business of research, just like you didn’t hit a lady.

It has long been a college tradition to offer undergraduate students extra credit for taking part in graduate student and professorial research.The result is that a lot of the research done in academic venues has used college students as subjects – hardly a representative sample of the human population.

Prisoners, students, school children – these have always been easy samples to draw, but they’re not random samples and, therefore, they cannot be representative of any larger groups and results of studies using them as subjects cannot be generalized to any larger population. They are used because they are easy. Such samples are called “convenient samples.” They are an example of nonrandom samples.

Another kind of nonrandom sample that is sometimes used, especially in ethnographic research is the snowball sample. The idea is that you find one subject from a population and let them direct you to another individual from that population, and so on. This is a useful strategy when studying populations that are difficult to find or identify, for instance, the Were community.

The problem with nonrandom samples is that a population usually has considerable internal diversity and, if you are going to study that population, you need to capture as much of that diversity as possible in order to be able to say anything about the population as a whole. If you want to study the human population and you use college students as subjects, you are not going to be able to say anything about older segments of the population, self-supporting individuals, people with permanent jobs, etc. The diversity in the greater population will be greatly restricted. In order to have a representative sample for a population, all the attributes of all the individuals must have an equal chance of being studied in the sample and that requires a random sample.

A rather naive approach to nonrandom sampling that has ben tried in the past is to try to foresee all the variables that might effect the study and try to include them all in approximately the same proportions as in the population in the sample. That doesn’t really work because it’s rarely possible to predict all the variables that are significant to the study.

I’m not saying that a nonrandom sample is useless. Studies using nonrandom samples can be used to look for topics to study in greater detail in later studies or for studies to generate ideas for hypotheses to test or strategies to use for further studies.

Really, for generalizable studies using samples, random selection is pretty much necessary.

There are many, many schemes for drawing random samples. Simple random samples are, predictably common because they are … simple.

A simple random sample consists of individuals drawn from a population in some manner such that each individual has an equal chance of being picked. You might do that by going through the population, generating a random number for each case and, if the random number is above a certain value, select that case, until you have the number of cases you want.

That wouldn’t work because the ones that are left below where you stopped would have never had the chance to be picked.

A better option is to scramble the list of cases so that they are in random order and then pick the top however-many-you-want cases. There are actually many different possible schemes for drawing random samples from large populations and, of course, there is software out there that will do it for you. DANSYS has such a procedure.

Simple random sampling is just the beginning. There are many other ways to randomly sample from a population and you can do it as simply or as complicatedly as you like. Actually, that should be “as you need”, because there are very good reasons for complicating the procedure and we will be running into those reasons as we go along.

How many?

Intuitively, larger samples are more representative of the population from which they are drawn. After all, if the sample is the same size as the population, it is the population. The statistics will be exactly the same as the population parameters. But how big does a sample have to be to capture the essence of the population?

We have seen that the larger a sample, the less sampling error is present. Standard error formulas usually have a square root of the sample size in the denominator so, as the sample size increases, the standard error, say, of the mean decreases.

Another issue is how confident you want to be that the actual measure falls within a certain interval. The larger the sample is, the narrower the interval will be.
It also should be fairly obvious that, the more variance there is in your measurements – the more spread out they are – the more likely a large or small value will knock a small sample off kilter.

One formula for determining an appropriate sample size is:
sample size = Z conf. 2 * sd * (1-sd)/ci^2
where Z conf. is the Z (standard normal) score for your preferred level of confidence, sd is the expected standard deviation of your sample, and ci is your chosen confidence interval (the + or – amount of error you will be comfortable with in your results.)

Let’s say you want to be at least 90% sure that you have captured the mean response in a survey study. You don’t know what the variability of your responses will be, but 0.5 is a safe number to assume at the outset. And you want your results to be within 10% of the true mean.

The value at 90% of a standard normal curve is 1.645 so the calculation gives you:

(1.645)2 * 0.5 * 0.5/0.052 = 270.6025

so you need 271 respondents for the sample.

Notice that the population size doesn’t even play into this fomula.

Population size is an issue, though. smaller populations tend to display more variability so smaller populations need proportionally larger samples to capture that variability. For populations of 100 or less, it is suggested that a sample not be taken and that the whole population be polled.

Strata and clusters

What makes medical residents miserable? That’s the simple question a team of researchers sought to answer in a 2009 Brazilian study (Macedo, Paula Costa Mosca, et al. (2009) Preditores de qualidade de vida relacionada à saúde durante a residência médica em uma amostra randomizada e estratificada de médicos residentes. Revista Brasileira de Psiquiatria, vol. 31, no. 2 Sào Paulo, June 2009 – there is an English translation at http://www.scielo.br/scielo.php?pid=S1516-44462009000200007&script=sci_arttext accessed 7/22/2017). The physical conditions of residents seemed to be better than the mental state, but mental state improved in the second and third years of residency. Presumably, in Brazil, as in the United States, the new guy does all the work. One of the big factors found was whether the “ambitions and hopes” of the residents match their personal experience and perception of their position in life. Other factors were sufficient leisure time and less than 30 hours a week care of critical patients.
The study looked at a stratified sample of residents at Universidade Federal de Sào Paulo. In other words, random samples were drawn from three different populations – students in their three years of residency. Why did they do that?

The well composed research report is explicit about that. “Randomization was carried out for each year of residency to guarantee representation over all years and maintain the total population proportions of medical residents.” Those are the usual reasons for stratifying samples. If a study involves a well structured population containing several subgroups that differ in factors important to the study, then it’s a good idea to draw separately from each subgroup in a way that creates a study group that has approximately the same proportional representation of each subgroup as in the population being sampled. The subgroups in the population should, together, make up the whole population, and there should not be overlap between the subgroups.

A well designed stratified sample can also show less variability of it’s sample mean than a simple randomized sample.

In a stratified sample, the strata are clearly different – that’s why they’re selected. In another form of sampling, cluster sampling, the population is divided into subgroups (clusters) that are similar. There should be as much diversity as possible inside each group, though. Each group should look like the population. The clusters are what are sampled.

In a single-step cluster sampling, the clusters to be studied are chosen at random from the group of clusters. All the individuals in the chosen samples are included in the study. In a two-step cluster sampling, individuals from each of the selected clusters are randomly sampled.

The primary reason for cluster sampling is to reduce the time, work, and cost of a study, but more so than in stratified sampling, cluster sampling is done to reduce the expected random error in the study. Cluster sampling is very useful when a sampling frame is not available for the population.

Parameters and statistics

Now a word from our sponsor, “Terminology”.

Technically, if a “statistic” refers to a sample, then it is, indeed, a statistic. If it refers to a population, it’s called a “parameter”. So an average can be either a statistic (if it is the average of the members of a sample) or a parameter (if it is the average of the members of a population). Statistics are usually abbreviated with latin letters (for instance, m for mean). Parameters usually get greek letters (for instance, mu – µ – for mean).

The differences are actually a little more than just convention because statistics are often calculated a little differently than parameters. For instance, the formula for the standard deviation of a population is the square root of a fraction with the number of members of the sample as the denominator. When the standard deviation is calculated for a sample, the denominator is usually the degrees of freedom (number of members minus 1) instead. The use of degrees of freedom is done to offset bias introduced by the square root function. In fact, even with small populations (containing less than 10 members) it is better to use the “sample” standard deviation.

Sampling distributions

I have a program in DANSYSX called “Monte”. Not “full monte”, that’s a whole other thing, although this one does strip – it strips samples from data sets automatically. The procedure is called “Monte Carlo” because, like what happens in casinos, Monte does a lot of random things. Specifically, Monte randomizes a data set and selects a specified number of data points off the top, then it continues doing that a specified number of times.
Monte Carlo methods are numerical, meaning that you don’t have to juggle equations to get answers (those methods are called “analytic methods), so there are a lot of problems that are untouchable by algebraic means that you can solve by Monte Carlo methods. In essence, Monte Carlo methods are simulations of probabilistic, deterministic processes. There are so many probabilistic, deterministic processes in the real world that it’s no surprise that Monte Carlo methods are popular.

They offer another approach to calculus – summations, integration, differentiation, and optimization. They also provide a novel, but very intuitive approach to statistics, and that is how we will be using them here.

Means

We’re getting to the point that we need to talk about things before we actually address them. Statistics can be a rather convoluted field sometimes. But I will make some brief inroads into some areas that we will explore in more detail later.
You’ve likely heard of means, or averages, before – they’re almost the same things. I will give you my favorite explanation. If you dumped all the data points in a big data set into a bag and then reached in and drew a data point out at random, what would be your best prediction of what that data point would be before you looked at it? The answer is, “the mean.” That is why the mean is often called the “expected value.”

But there are several kinds of means. If you hear someone say, “the mean,” or “the average” without qualification, you are justified in understanding them to be talking about the arithmetic mean, or arithmetic average. That’s the most common one. To find the arithmetic mean, you add all the values of a variable in a data set and then you divide the result by the number of values. That will give you a “middle value”.

Now, let’s look at the Old Faithful data again, specifically the intereruption times. The arithmetic mean is 71 minutes and here is a histogram of the intereruption times.

Obviously, this mean isn’t very faithful to my favorite explanation. If you were to bet on a data value selected at random, 71 would be a very poor value to bet on. Let’s use MONTE to see what happens when we take repetitive samples of different sizes and calculate the average of the averages of the samples. Here are histograms for sample sizes of 5, 10, 50, and 150. All include averages of 400 samples.

The averages remain fairly stable, around the average of the raw data. But the histograms change. There is a ghost of the old bimodality when samples of 5 or 10 are taken but, for larger samples, it’s hard to see even a hint of bimodality. With such promenant bimodality, it is surprising how quickly it goes away using MONTE.

But look at the standard deviations of the sample means! The standard deviation of the means (also called the standard error of the mean) tells how far the means wander from the population average. For samples of size 5, the standard deviation is 5.22. For samples of size 10, it’s 3.9. For samples of size 50, it’s 1.54. And for samples of 150, the standard deviation is 0.6. The sample means cluster closer and closer to the population mean as the samples get larger. In fact, the standard error of the mean is equal to the population standard deviation divided by the square root of the size of the sample. So, now you can see why, if you have to draw samples from a population, larger samples give you better statistics. A large sample mean is more likely to hit the population mean than a small sample mean and, even if it doesn’t, it will tend to fall closer to the population mean.

This phenomenon is called Central Limit Theorem. As larger samples are taken repeatedly from data that deviates strongly from normal, averages of statistics derived from the samples become more and more normal. This concept is often used to make statistics “behave”.

Proportions

Other statistics also follow the Central Limit Theorem, for instance, if you are looking at what proportion of a population has a characteristic and you start taking samples, the collected proportions will approximate a normal distribution. The standard error of the proportions will, like the standard error of the mean, be the standard deviation of the proportion divided by the square root of the sample size, which works out to the square root of p(1-p)/n, where p is the proportion and n is the sample size.

Sums and differences

The sum of statistics (like the mean) is sometimes of interest to researchers. For instance, if you take the mean of a control group and the mean of a study group, and add them, the sampling means of these sums, you guessed it, have a normal distribution.

The standard error of these sums is easy – just add the sample variances and take the square roots. Yep, sample variances (the squares of the sample standard deviations) are additive like that.

Differences are a little more complicated. To find the standard error of the difference between two means, you have to divide each variance by the size of the sample and then take the square root of the sum.

Variances

The standard error of the standard deviation of a sample is just the standard deviation of the sample divided by the square root of double the sample size. The problem is that this formula can be very inaccurate for even moderate sized samples. As sample sizes approach the population size, convergence to normal shape is very slow.

Unbiased estimates

What if you do that Monte Carlo trick of taking random samples over and over and the average statistic does not tend to the true value of that statistic for the population. Such a statistic is called “biased”, and biased statistics are used, but they should only be used with an understanding of the bias involved. But how do you, the statistic user, know whether a statistic is biased or not?

Look up the Wikipedia article on “Cramer’s V” and, in it you will see this, “Cramér’s V can be a heavily biased estimator of its population counterpart and will tend to overestimate the strength of association.” The amswer is, “look it up.”

But how do they know?

That’s the difference between practical and theoretic statisticians. Practical statisticians don’t need to know calculus to do their job; theoretical statisticians do. Theoretical statisticians are the ones that create the statistical tools that the practical statisticians use. They also have methods for testing the statistics they (and others) create, so statistics journals have lots of information on how particular statistics are or are not biased.

Statistics are there to give you accurate estimates of population parameters, to some dehree of accuracy. A statistic’s bias tells you how well it does that. An accurate estimator is a statistic that has a small bias.

One classical statistical problem is how to choose from among the many measures of association when working with nominal data. Most commercial packages for dealing with nominal data will spit out a dizying array of such number intended to tell you just one thing – how strongly are the variables related. And all the numbers are invariably different – often very different. How can this be good.

Actually, the different statistics often measure different things and they measure them in different ways. You have to know what the statistic you are using measures and how it measures it. Also, different measures of association may be biased to different degrees, and one of them may either overestimate or underestimate the population parameter.

The moral is, be familiar with your tools.

Point and interval estimates

We will be looking at how to test research questions given observed data later but we are at a point now to understand a very basic concept underlying two kinds of such tests.

Research is driven by questions. Questions generate hypotheses.

Here’s an excerpt from an abstract of the study “Association of Depression and Diabetes Complications: A Meta-Analysis (de Groot, Mary, Ryan Anderson, Kenneth Freedland, Ray Crouse, and Patrick Lustman (2001) Psychosomatic Medicine, July-August 2001, Vol. 63, Issue 4, pp619-630).

“A total of 27 studies (total combined N = 5374) met the inclusion criteria. A significant association was found between depression and complications of diabetes (p < .00001, z = 5.94). A moderate and significant weighted effect size (r = 0.25; 95% CI: 0.22–0.28) was calculated for all studies reporting sufficient data (k = 22). Depression was significantly associated with a variety of diabetes complications (diabetic retinopathy, nephropathy, neuropathy, macrovascular complications, and sexual dysfunction). Effect sizes were in the small to moderate range (r = 0.17 to 0.32).”

What does all that mean? Well, for our purposes here, the “p<” and “CI:” parts are what we want to get a grasp of. As for the technicalities, the research question for these folks was “Is depression associated with medical complications for persons with diabetes?” This was a meta-analysis, so they were looking through related journal articles and trying to make sense of the various reports. When they pulled all the data together, they found that “A significant association was found between depression and complications of diabetes (p < .00001, z = 5.94).” and that “A moderate and significant weighted effect size (r = 0.25; 95% CI: 0.22–0.28) was calculated for all studies reporting sufficient data (k = 22).”

The research question generated two major hypotheses and two minor ones.

There will always be at least two hypotheses: a null hypothesis and an alternative hypothesis. The primary null hypothesis ffor this study is “There is no association between depression and medical complications for patients with diabetes.” The alternative hypothesis is, “There is an association between depression and medical complications for patients with diabetes.”

You see, in modern science, researchers try to disprove their pet hypothesis, so the hypothesis they test is the one they don’t want to see – the null hypothesis. They want to know if they can safely dismiss that hypothesis. If they can, then they feel justified in thinking that their first guess (the alternative hypothesis) is correct.

The results of any statistical test should always contain two parts, a measure for the size of an effect and how much you should trust the result, what is called “statistical significance”. There are two approaches for testing statistical significance and they differ in method and philosophy.

As you should know by now, the average (or any other statistic) measured for a sample is not necessarily the average of the population the sample was taken from. The relevant question is, how far off might we be off? And can we ignore the difference?

Meta-analyses have a complication not present in single studies. The researchers have to combine results from several studies, probably several different kinds of measures, to represent how “big” the results were. There are ways to do that and the measure that results is called an effect size. Here, the effect size is z=5.94. It’s an average measure of the association seen in all the studies. A z of 0 would mean that no asociation existed between depression and compications of diabetes.

Statistics like z (which is a standard normal score so it is from a normal distribution with a mean of 0 and standard deviation of 1) are predictable. The most likely z score is 0. The larger or smaller it is, the less likely it is. So, is a z score of 5.94 surprising? The z score is a measure of the likelihood of the null hypothesis being true (remember, we test the null hypothesis.) and the probability of finding a z of 5.94 from a standard normal curve is less than 0.00001. This is the probability that we reject a null hypothesis when it is, in fact, true. 1 time out of 100,000 is a very small probability and the probability that the researchers found was even less. So we should reject the null hypothesis, right?

This process of finding a measurement, a single point value, and checking it’s probability – the amount that it surprises us – is called a point estimate. z is a statistic. It comes from a sample so it is not the value that we would get if we were measuring the whole population. It is an estimate, but this study indicates that the effect size measurement obtained is a good measurement of the true effect size.

Another test is provided for all the studies that reported a significant effect size.” r” is usually used to represent another measure of association called a correlation coefficient. Here, r is 0.25, again indicating a positive association. Correlation coefficients usually run from -1 to 1 with negative values indicating that, as one variable (say, depression) increases, another one (complications from diabetes, for example) consistantly decreases. Positive values indicate that, as one variable increases, the other variable also increases. A correlation coefficient of 0 indicates that the two variables do not change in any related way.

Like all the other statistics we’ve seen, correlation coefficients taken from random samples of reasonable size tend to follow a normal distribution, so if you know what the average sample coefficient is (and we do, it’s 0.25), and we know how these sample values vary (and, although the variation isn’t reported here, the researchers knew.), we should be able to figure out what values lie within, say 95% of the normal curve from the true average, an that’s what the study reports. The researchers could be 95% sure that the true average lay between r=0.22 and r=0.28. The techncal term for the 95% is the alpha of the confidence interval. The important thing here is that what isn’t between 0.22 and 0.28 is 0. The researchers could state that they could be 95% confident that the true population value of the correlation coefficient was not 0, that there was less than 5% probability that the variables they were looking at were not related.

This kind of reported measure is called a confidence estimate.

It’s is not too common that point estimates and interval estimates of a statistic do not agree, but because that sometimes happens, I, personally, believe that both point and interval estimates should always be reported if at all possible and with modern statisical software, it is almost always possible. Given both forms of test results, with an understanding of what point and interval estimates mean, it should be fairly easy to understand conflicts between them.

We will be looking closely at point estimates derived from statistical tests later, but let’s look a little more closely at confidence intervals and how they are calculated.

Confidence intervals

Perhaps an easy way to see what’s going on with these confidence limits is to look at one.

Is 98.6 normal body temperature. Does body temperature differ for males and females. A study was done and reported in “A Critical Appraisal of 98.6 Degrees F …,” by Mockowiak et. al., in the Journal of the American Medical Association (vol 268, pp. 1578-80, 1992). There were 65 healthy men and 65 healthy women in the study.
I loaded the body temperatures recorded for the two groups of subjects, male and female, into DANSYS. I won’t reproduce the raw data here but will start with the descriptive statistics. Keep in mind, these are statistics, not parameters. We’re trying to generalize from this study group to the larger human population.

The average male body temperature was 98.1 with a standard deviation of 0.7 and variance of 0.49. The average female body temperature was 98.4 with a standard deviation of 0.74 and a variance of 0.55. The averages differ by about 0.3 degrees, but, given the variability of the two groups, is that enough to indicate a real difference in the population?

This is a random sample, so we can assume that, if random samples are drawn over and over and collected, the sample average and variability will follow a normal distribution, and that the reported statistics would be somewhere in that mix. So we already have a lot of information to play with.

We know that about 68% of the sample means will be within one standard deviation of the true mean and that 95% will be within two standard deviations. Actually, 95% of the means will be within 1.96 standard deviations and that’s the confidence limit we will be aiming for. We want to be sure to capture 95% of the sample means. That’s the alpha for our test – 95%.

We also know how the difference between sample means behaves. The standard error for the difference of these sample means is calculated by dividing the variances by the sample sizes and adding them, then taking the square root. The result is 0.13. That is the standard deviation for sample differences. We can expect 95% of the differences between sample means to fall between the population mean ± 1.96 times the standard deviation, and that is exactly what the confidence interval is. Since we don’t know the population difference between the means, we use what we do know – the sample difference.

Thus, the 95% confidence interval is from about 0.04 to 0.54. remember that this is the interval within which you expect to find the difference between 95% of the sample means. So, how do we use this?

Well, what are we trying to find? Our alternative hypothesis is that there is a difference – a real difference, not just one caused by random error. The null hypothesis that we are testing is that there is no real difference, or that the real population difference between the average body temperatures of healthy men and women is zero. If 95% of the sample differences is between 0.04 and 0.54, then we can be 95% sure that 0 isn’t a real population difference. In other words, we can be 95% sure that there really is a difference between male and female body temperatures.

If you want to be more than 95% confident, say 99% confident, you simply use a different value in the confidence interval calculation: 2.58 instead of 1.96. Then your confidence interval will be from -0.04 to 0.62. Oops! Zero is in that interval.

You can’t be 99% sure that there is a real difference between male and female body temperatures. In point of fact, you can’t be 100% statistically sure that the difference isn’t any value whatsoever. Physically, you can be pretty sure that any body temperature isn’t less than 0 degrees Kelvin because that isn’t physically possible.
The question you have to ask before you begin looking at the numbers is, “How certain do you want to be?”