Distributions

Nine out of ten doctors
“I’m not a doctor but I play one on TV” (an advertisement I remember from my childhood.) “Nine out of ten doctors prefer….”

What does that mean. Well, if someone went out and asked 10 doctors and 9 preferred A and the other schmuck preferred B, it means that those 9 doctors preferred A and the other one preferred B – nothing more.

But if they went out and randomly chose 10 doctors out of the 854,698 doctors in the United States and asked them what they preferred, it means that there’s a good chance that, if you asked any 10 doctors, 9 would prefer A and the other would prefer something else.

Better still, if you asked a tenth of the 854,698 doctors in the United States(that would be a sample of 85,469.8 doctors) about their preference and 76922.8 (that’s nine tenths of 85,469.8) answered that they preferred A and all the others preferred B, you could answer pretty confidently that there is a probability of 9/10 that any doctor will prefer A to B.

A probability is a fraction. The numerator of the fraction is the number of events of interest; the denominator is the number of total events. In the above example, there are 9 events of interest – a doctor preferring A – and 10 events total – 10 doctors asked. If you want to know the probability of throwing a 3 using a six sided dice, out of 6 events (a face showing), only 1 event is of interest (that of rolling a 1), so the probability is 1/6.

Probabilities work just like all other fractions. For instance, a complicated probability can be simplified just like a complicated fraction. 76922.8/85469.8 is the same fraction as 9/10 so the probability 76922.8/85469.8 is the same as 9/10.

Like fractions, probabilities can take different forms. 9/10 is the same as 0.9, is the same as 90%. Also, “nine out of ten” is the same as “9 to 1” or 9:1, as in “odds are nine to one that….”.

Simple probabilities are easy; complicated ones are just a little more difficult.

What’s in a Grecian urn? The terms

Statisticians, especially folks that teach statistics talk a lot about urns filled with different colored marbles, coins, and card decks. People probably get the idea that statisticians are always playing games, and that might not be too far from the truth, but urns, marbles, dice and cards are just the trappings. Let’s look at a few common concepts in probability theory.
An experiment consists of one or more repeated trials.
A trial is one probabilistic event – a roll of a dice, a flip of a coin. One or more outcomes can occur. For the roll of a six sided dice, one of six outcomes can occur. If a desired event is specified before hand, such as a three coming up on a six sided dice, the outcome can be success or failure (or hit or miss).

Certain letters are fairly consistently used to designate specific numbers that appear in probability.

p is usually used to denote the probability of a success in a particular trial. q is usually used for the probability of a failure on a particular trial (If p and q are added, the sum should be 1 since the outcomes can only be success or failure and the probability of certainty (it is certain that an outcome will either be a success or failure) is 1.) x is the number of successes in the whole experiment.n is the number of trials conducted in the experiment.

Do you add or multiply?

If you throw five twenty sided dice, what is the probability that you will turn up five sevens? If you’ve had a course in probability, you probably know that you should figure out what the probability is that you throw a seven on one dice (that would be 1/20) and then either add or multiply that probability by itself four times. Well is it 1/20 + 1/20 + 1/20 + 1/20 + 1/20 or (1/20)(1/20)(1/20)(1/20)(1/20)?

With fractions, addition gives you a larger result and multiplication gives you a smaller result, so you can ask yourself if five sevens would be more likely or less likely. I would say that five sevens would be very unlikely, so the answer is probably “multiplication”.

The other situation is, “What is the probability that you would obtain an odd number on the roll of a twenty sided dice?” This would be a much greater probability than 1/20, so you should guess that addition is the right operation, and you would be right.

There are also grammatical clues as to which operation to use. In the first instance, you could rephrase the problem as “What is the probability of getting a 7 and a 7 and a 7 and a 7 and a 7 when five twenty sided dice are thrown?” The second could be stated “What is the probability of throwing a 1, or a 3, or a 5, or a 7 when you throw a twenty sided dice?”

Typically (keeping in mind the vagaries of the English language), if the individual probabilities are separated by “and” you should use multiplication to calculate the compound probability; if “or” is used, you should add.

I like my independence

What really determines how you calculate a compound probability is something called independence. Two events are independent if the occurrence of one does not influence the probability or the other at all. For instance, the probability of throwing a 3 on a six sided dice doesn’t influence the probability of throwing a three on a second roll at all. On the other hand, the probability of drawing a particular card in a hand changes the probability of drawing a particular card on the second draw because the number of cards in the deck has changed.

Let’s look at that.

What is the probability of drawing two red cards from a deck of cards. You know that you will be looking at a product because you want the probability of drawing a red card and, then, another red card. The “and” is the give away. There are 52 cards in a deck and four suits. There are two colors: red and black. Half of the cards are red so the probability of drawing a red card on the first draw is 0.5. The probability of drawing a red card on the second draw is not 0.5 because there are no longer 52 cards in the deck and, if the first card is red, there are no longer 26 red cards in the deck. There are 25 red cards in the deck of 51 cards so the probability of drawing a red card on the second draw is 25/51. The probability of drawing two red cards, then, is 1/2 * 25/51 or about 0.245.

That brings up another concept of probability. If you have a collection of different colored marbles in an urn and you want to determine the probability of drawing two marbles of a specific color, you could draw one, put it back, shake the urn up, and draw another. That would be drawing marbles “with replacement”. If you did not put the first marble back, then the drawing would be “without replacement”. The two cases would lead to different probabilities. For instance, in the above example of drawing two red cards, with replacement, the probability of drawing two red cards is 0.5 * 0.5 or 0.25.

So, what is the probability of drawing a particular straight flush in a hand of poker? A straight flush is five cards of sequential rank of the same suit. The first card can be any card above 5, but once that card is drawn, all the others are set. If a 7 of hearts is drawn first, the rest of the cards must be 6, 5, 4, and 3 of hearts. So the first card can be any of 32 cards and the probability of drawing this is 32/52 or 16/25. The rest of the cards have the probability of 1 out of the remaining deck to be drawn, or 1/51, 1/50, 1/49, and 1/48. Multiplying the three probabilities gives about 1.07×10-7.

The probability of drawing any straight flush would be the sum of the probabilities for all 8*4=32 straight flushes or 3.42×10-6.

Just on the condition

Often, the question is, “Just on the condition that event A has already happened, what is the probability that event B will happen?” For instance, “My thumb hurts, what’s the probability that it’s arthritis?” This is called a conditional probability. The probability of event B is conditional on probability A.

This situation is symbolized by P(B|A), which is read “The probability of B given A.”

It seems intuitive that the probability of two events occurring together (A and B) will be equal to the probability that the unconditional even (A) occurring, and then, given that it occurred that the conditional event (B) occurs. Mathematically, that would lead to:

P(A and B)=P(B|A)P(A)

Usually “the probability of A and B” is symbolized “P(A U B)”

That leads to the usual formula for a conditional probability:

So, let’s say : the probability that a sore thumb and arthritis will occur together is 15%, and you are 100% certain that your thumb hurts. This formula gives the completely sensable result that you have a 15% chance of having arthritis.

It looks so sensible, in fact, that you might wonder what good a conditional probability is.

Well, let’s say that you can estimate that, since five out of fifteen people in your last generation had sore thumbs, there’s a 33% chance that you will also have a sore thumb. What’s the chance you will develope arthritis? (You don’t know what caused your ancestors’ sore thumbs, you only know a reported correlation.) That works out to .15/.33 or about 45% chance that you will have arthritis.

Counting is fundamental

You may have noticed that, not only is a probability a fraction, but the numerator and denominator are both counts. That places the mathematics of counting, combinatorics, into intimate contact with probability.
For instance, there is only one way to draw four aces from a deck of cards, but how many four card hands are possible?

For the first card, you have 52 possibilities. For the next, you have 51 cards to choose from, so there are 52*51 = 2652 two card hands. Then ,for the next card, you have 50 possibilities, so there are 52*51*50 or 132600 three card hands. Of course, you see what’s coming – there are 52*51*50*49 or 6,497,400 ways to draw all the four card hands. But we’re not interested in the order the cards are drawn, so, since there are four ways to draw the first card, three ways to draw the next, two for the next, and then the one left, we need to divide the 6,497,400 hands by 4*3*2*1 = 24, giving us a count of 270,725 hands.

If you’ve been sharp, you might have noticed something that looks like our old friend, the factorial, in these calculations. In fact, you should be able to see why the number of ways to choose k things from a set of n elements is n(n-1)…(n-k+1)/k!. With a little algebra, that works out to the usual formula counting for such combinations: n!/(k!(n-k)!).

Combinatorics is a very large and intrecate feild of mathematics but, given the above, I think you can see why it comes up over and over in both probability theory and statistics, so, when you see factorials popping up in the formulas for nonparametric statistics later, you’ll have a feeling for why they’re there.

Deviates

Questions like, “What is the probability of drawing two cards of the same suit in two draws from a shuffled deck of playing cards?” is easy to grasp because playing cards are easy to count. How do statements like, “One in sixty-eight children have autism spectrum disorders,” or “A randomly selected adult human has a 66% chance of being beteen 61 inches and 67 inches tall,” happen? Did someone go out an count all the children and those with autism spectrum disorders. Is there a census of the heights of all adult humans?

The answer is that random items are distributed in predictable ways. A random variable is a characteristic that can take any value within a certain range and the value of which arises sequentially in an unpredictable fashion. For instance, if you decide to walk down your street and ask the fifth person you meet their height, you will not be able to predict the exact value. Peoples’ heights are a random variable. Any particular value of a random variable is a random deviate.

What does a random deviate deviate from? Why, an expected value. If you had to guess the height of a randomly selected human, your best bet would be the average height of all humans, which is 64 inches. You should probably be surprised if any particular human, randomly selected, has a height of exactly 64 inches. You would expect any particular height to deviate from the expected height.

Looking at distributions

You can know the probability of an event by understanding the pattern that events like the one you’re interested in follow. Remember that there are two kinds of statistics. One describes data and the other tests data to tell you whether what you think you see happening is actually what you see happening or not.
The task of descriptive statistics is to tell you the patterns that data follow. One kind of descriptive statistics is exploratory statistics – statistics that allow you an initial look at your data to give you an idea of what you’re looking at. Graphics are exploratory statistics that help you grasp patterns in data by translating them into the most “graspable” sense – vision.

James L Bruning and B. L. Kintz includes the following data set in the second edition of their classic, “Computational Handbook of Statistics”. It collects the heights of twenty twelve-year-old school children. To help visualize the data, I have broken them down into frequency classes and then graphed those. You’ve seen the process before in the StatFiles. I’ll go over it again.

Take each data value and throw it into the bin that contains data values of a particular range. So 57 inches is tossed into the bin of values between 55 inches and 58 inches. 67 ends up in the bin of values from 64 inches to 67 inches and so on.

The first column of the data are the actual data values. The second is the upper limits for each bin, and the third column is the number of values in each bin. The graph is a bar graph of the frequencies – you know that kind of graph as a histogram.

This graph allows you to see what the data are doing. Does it surpise you that the histogram looks somewhat like the data for Old Faithful?

At this age, male and female children are growing at drastically different rates so what this histogram tells you is that we actually have two different groups of data.
Histograms are the specific tool we us to look at data. But how do you get probabilities from that?

In this instance, you have 20 children and 6 of them are between 58 and 61 inches in height, so you can say that, according to this sample, you can expect there to be a 6/19 or 32% probability that a random 12 year old child will be between 58 and 61 inches tall. The caveat, and it is a strong caveat given the small sample size, is “according to the sample”. Generalizing from a sample this small is dangerous.

Being discrete

We will look at a few common probability distributions but, before we do that, we should understand one more concept concerning these distributions. Some processes can be very limited in the kind of values they can take. For instance, regardless of how you throw a six sided dice, there are only six values that can result: 1, 2, 3, 4, 5, or 6. If you are throwing more than one dice, you can only come up with whole numbers. If you start adding 0.5 to 1 over and over again, the subtotals are all going to be multiples of 0.5. Processes like these are called “discrete”, and probability distributions generated by such processes are called “discrete probability distributions.” Discrete distributions are quite common when dealing with processes involving counts. The first distributions we will look at will be discrete distributions.

The binomial distribution

Consider the following experiment (or actually do it. All you need is a coin.) Toss the coin 6 times. What is the probability that you will obtain exactly three heads?

There are 6 trials here. The number of hits is three, and the probability of a hit is 0.5. We talked about combinations earlier. You can calculate your probability by taking the combination of experiments (6) taken 3 at a time, and multiply the result by the probability of a hit (0.5) to the number of hits exponent times the probability of a miss (0.5) to the number of misses exponent. Crunch the numbers and you should get 0.3125.

That should make some since. The calculation simply gives you the probability of getting three hits and three misses – that is, the probability of getting a hit three times (0.5*0.5*0.5) and the probability of getting three misses (0.5*0.5*0.5) but the number of ways to do that is the combination of six experiments taken three at a time, so the probability of getting three hits and three misses one way must be multiplied by the number of ways it can happen – so the calculation we did.

The number of hits (or misses) is a binomial deviate. If you repeat the experiment many times and record the results, you will have a binomial distribution. To make it clear, let me list the characteristics of a binomial distribution.

There are repeated trials in the experiment.

Each trial can only have two outcomes (hit or miss).

The probability of a hit is the same for all trials.The trials are independent. The result of one trial does not affect the result of any other trial.

If you’ve read my explanation of a normal distribution on the Therian Timeline, you may say here – “Are you sure you were not talking about a binomial distribution?” Okay, I admit it. What I was describing was indeed compound binomial distributions. They were certainly discrete processes, being built up from counts. But you will find out later that the normal distribution is intimately related to the binomial distribution.

Binomial distributions appear quite often in nature, so often that most statistical software, including the functions that show up in spreadsheets, include commands and functions specifically for binomial distributions. We will run into the binomial distribution again and again.

You may have heard another term for a binomial experiment – “Bernoulli process”. That applies particularly for a single trial binomial experiment.

The Poisson distribution

Another kind of discrete experiment involves a “region”. It can be a region such as an geographic area, or it can be a conceptual region such as a Cartesian plane or data set with two variables. Again, we’re dealing with hits and misses. Those are the only two outcomes. The average number of hits in the region is known. The probability that a hit occurs is proportional to the size of the region. As the size of the region becomes very small, the chances of a hit occuring approaches zero.

The number of hits in a period of time follows a distribution known as a Poisson distribution. The Poisson distribution, for reasons that should now be clear, is often called the distribution of rare events. Imagine a big ranch with many cows. The number of cows that might be struck by lightning over a period of time could be predicted if records had been kept of lightning struck cows in the past because it would follow a Poisson distribution and Poisson distributions behave in very predictable ways.

For instance, if m is the average number of cows struck by lightning in a given year, then the chance that x cows will be struck over the next year will be:
(e^-m)(m^x)/x!

e is 2.71828

So, if the ranch looses 3 cows a year to lightning, the chances that it will loose exactly 1 cow next year will be e^-3*3^1/1! or 0.15. The probability that 2 cows will be hit is e^-3*3^2/2! or 0.22.

Binomial and Poisson distributions have distinctive looking histograms and they have their own set of statistics for working with them, but I feel that the most important thing to know is how they arise – the kinds of processes or experiments that produce them. That way, if you know what kind of process created the data at hand, you already have an idea about the kind of distribution you’re working with.

The geometric distribution

We know how to figure out the probability of getting a hit out of, say, 6 trials. That would follow a binomial probability distribution, but how could we figure out how many trials it will take to get a hit. Let’s say we wanted to know the probability that it will take 3 trials to roll a six on a six sided dice. Said another way, what is the probability of 2 failures before we get a six?

That would be called a geometric probability distribution and we could calculate that probability using the formula:

x(1-x)^n

where x is the probability of success in one trial and k is the number of trials. In this case, (1-1/6)3*1/6=0.1.

But, there is a problem. Theoretical statisticians get sloppy occasionally and, it turns out, there are two different kinds of geometric distributions. There’s the one we just talked about, and then there is the number of Bernoulli trials needed to get one hit. The difference might be subtle but it’s real. The probability that the nth trial is the first success is:

x(1-x)^(n-1)

So, before you decide that a process is giving you a geometric distribution, you first have to know which kind of geometric distribution you’re talking about!

Some other discrete distributions

Binomial distributions arise when you are tracking numbers of hits or misses in a process. Data that are proportions of hits vs. misses often show a binomial distribution also. For instance, is the number of minority students in the schools of a district vs. non-minority students a matter of chance or is there some other processes leading to the proportions?

Binomial distributions deal with two state systems – hits/misses, minority/majority, component A/component B, etc. What if you have more than two components in the mix. For instance, what if you have an urn containing 5 red balls, 7 green balls and 15 blue balls? What is the probability of drawing a red, a green, and a blue ball on the first three draws? What about two reds and three blues out of seven draws? These kinds of experiments follow a multinomial probability distribution. The multinomial probability distribution is a generalization of the binomial probability distribution. The binomial distribution is a special case of the multinomial distribution.

Another distribution related to the binomial distribution is the negative binomial distribution. In a negative binomial process, like the binomial case, you have n repeated trials with only two possible outcomes – hit or miss. The trials are independent and there is the same probability of a hit on each trial. The difference is that you continue until there are x hits. n is the negative binomial variable.
That may sound a little like a geometric distribution and, indeed, the geometric distribution is a special case of the negative binomial distribution. If you study these distributions in some detail, you will find that just about all of them are related to all the rest in some way.

In a hypergeometric experiment, a sample of x individuals are randomly selected from a population of X individuals. Each individual can be categorized according to some characteristic as a hit or a miss. There are k hits in the population and X-k misses. The number of hits drawn in a random sample would follow a hypergeometric distribution. Later, we’re going to run into the hypergeometric distribution when considering a statistical procedure for small nominal sample called Fisher’s Exact Test.

Usually, the nature of a discrete distribution is discovered by considering the mechanics behind it and working out the probabilities, like we did for the binomial case.

Being indiscrete…and continuous

Other distributions might not be so straight forward. I’ve said that the binomial distribution sometimes looks a lot like the normal distribution that we’ve already met. When the number of binomial experiments stack up in a long chain of causes and effects, it becomes harder and harder to see the process as counts and easier to see it as straight measurement.

Height happens over a long chain of causes, molecular events occurring over a long period. Each event might look something like a binomial experiment but how would you add up all those counts. Generally, we don’t. We simply measure a person’s height and that height could be anything between the height of the shortest person who ever lived and the tallest person. Heights don’t have to be whole numbers. At some quantum level, they are, but the way we measure them, they can be any fractional amount. Where discrete distributions typically dealt with count data – ordinal or nominal level data, we are moving into the realm of measurements – continuous data.

We’ve looked at frequency distributions in the form of histograms. The way continuous distributions are typically discovered is by studying these frequency distributions – not counts of specific counts but counts of measurements in equal sized intervals. No two people are exactly the same height so, if you counted the number of people in a group that have exactly one precise height, that number would always be one. The discrete distribution would be flat – each category would be the same size – one. The probability of having a particular height would be virtually zero.

More normality

Much of what is known about the discrete distributions are known from mathematical proofs derived from studying the processes that created them. All the formulas for working with the binomial theorem came from asking what would happen in a binomial experiment.
Most of what we know about the continuous distributions, on the other hand, happened the other way around. Someone saw a strange distribution that didn’t look like the known distributions and they started exploring what it acted like. For instance, a theoretical statistician would develop a test, run it many times to see what the frequency distribution looked like, and then work out the details by observation.

The normal distribution is a very well behaved distribution. It’s frequency distribution is very distinctive. It is very consistent.

Not only does the normal distribution appear frequently in nature, but it is very easy to work with. Any normal distribution is completely defined by two numbers – the average (arithmetic mean) and the variance (or standard deviation). The first number tells you the value that all the other data points collect around, and the second tells you how the data values spread out. Every normal distribution with a mean of 3 and a standard deviation of 1.5 looks exactly the same, and any major divergence from normality will stand out like a sore thumb.

In a normal distribution, the frequency interval with the most data points will be the one with the average right in the center. You can say right off that about 68% of the data values will be within one standard deviation from the mean. In a normal distribution with an average value of 3 and a standard deviation of 1.5, that would mean that 68% of the data values would be between 1.5 and 4.5. You can, further, say that about 95% of the values are within two standard deviations from the mean, and that 99.7% are within three standard deviations.

Here’s another weird but very useful fact about normal distributions.

Sampling distribution and the Central Limit Theorem

If you take a data set, whether it follows a normal distribution or not, and take a lot of random samples of equal size (without replacement), taking the average of each sample, and then look at the frequency distribution of the averages, you will find that, as the sample size increases, the resulting frequency distribution (called the sampling distribution) will become more and more normal looking. This is called the Central Limit Theorem. Of course, if the sample size is the same as the data set size, you will only get one average across the board.
That gives you an idea why the larger a random sample is drawn from a population, the better results you will get in a statistical analysis.

And there’s a bonus! Not only averages work like this but so do other statistics like medians, variances, and standard deviations, although measures of variance seem to converge more slowly to a normal distribution (you need larger sample sizes) than for measures of central tendency.

Statistical procedures that utilize repeated random samples are called Monte Carlo methods. For instance, if you want a well behaved (normal) set of statistics for an unruly (non-normal) data set, you might want to take the statistics of a lot of statistics of random samples from the data. There is a 68% confidence interval that the average you get will be the true average – in other words, there is a 68% chance that the true average will be within one sampling standard deviation of the resulting average.

Being confident

Much of statistics is a confidence game – how confident can you be that the answer you come up with is right? The p value we talked about under The Basics is about confidence.

Statisticians also talk a lot about confidence limits (and so will we, often) and the central limit principle has a lot to say about it.

Since sampling distributions of many statistics follow a normal distribution, we can define an interval in which a true average of a population will likely fall with a certain degree of confidence and that interval is called a confidence interval. The values at the end points of the interval is called the confidence limits of the average at a specified probability. These values can be calculated and I will tell you how as we go along, but you can probably already predict that there’s a 68% chance that the true population mean will lie between one standard deviation of the sample means below and above a sample mean. And that gives you a hint about how confidence intervals work.

Why you can’t have a probability greater than one and what that means with continuous distributions.

A probability of 0 means that there is absolutely no chance of an event happening. A probability of 1 means that the event is certain to occur – absolute certainty. You cannot be more certain than that. If you get a probability of 1.5 or 150%, you can be absolutely certain that there is something wrong with your calculations or that there is a bug in your program. Probability always ranges from 0 to 1.

That means that, in a probability distribution, all the possibilities for the events covered exist somewhere below the graphed curve. If you have the histogram of all human heights, every human height is represented somewhere beneath the curve of that histogram. The area underneath a curve on the graph of a probability distribution is always 1.

Form here on out, we will be looking at a lot of probability distributions, most of them, unsurprisingly, normal curves. A question we will run into over and over is, “What is the area under the curve from the left end of the distribution to this particular value?”, or “…from the right end…”, or “…from this particular value to that particular value…’, or “…to the mean…” These are questions of probability. For instance, “What is the probability that a person will be between 50 inches tall and 60 inches tall?” is the same as saying, “In the normal probability distribution of human heights, what is the volume under the curve between 50 inches height and 60 inches height?” If this isn’t very clear to you now, it soon will be as we continue through the StatFiles.

Some other useful distributions

There are many other useful continuous probability distributions. Arguably, the most common (until you get into a specialized field like, say, quality control,) are closely related to the normal curve.

For instance, when I said that repeated sampling of data leads to a set of averages that follow a normal distribution, I was not exactly correct. In fact, sample averages, and other sample statistics follow a distribution known as the Student’s t distribution, but as sample sizes get larger, the t distribution quickly begins to look almost exactly like a normal distribution. That is why you will find that small sample statistics, that is, statistics that are used with samples of size 20 or less, commonly use the t distribution instead of the normal distribution.

Again, if you take two variables that follow a normal distribution and look at their ratio, you will find that they follow a distribution called the F distribution. Ratio tests are very common in statistics because you can easily tell, looking at a fraction, whether the numerator is larger than the denominator, smaller, or if they are equal. And since you know what the distribution looks like (or you will), you will be able to figure out if the difference is significant or not.

The other Top 4 distribution used by statisticians (number five is probably the uniform distribution) for hypothesis testing is the chi square distribution. That is the distribution of the sum of squared standard normal deviates, and you will see that sums of squares are also used a lot in statistics.

So, watch out! There are a lot of deviates out there and we’re going to be looking at many of them.

Next, we’ll be looking at how to take samples and how samples behave.