Testing hypotheses

What is a hypothesis?

Most of what we’ve been exploring in past pages of the Stat Files has been descriptive statistics, but we’ve also taken short excursions into the world of inferential statistics. Now, it’s time for us to jump headlong into the other side of statistics (how dramatic!).

Descriptive statistics keeps you pretty close to the data itself – it clarifies things that are already there, things you can see it you look at the right angles. But inferential statistics carry you beyond all that – from samples that are accessible to you to populations that are way too big for you to take it all in, from what seems to be happening to what actually is happening, to what has happened and what is happening to what probably will happen. They let you infer things you have no access to from things you do have access to.

But first, you need a guess. If you want to know about something but you can’t get to it to study it, you need to start with a guess. Take the Old Faithful data. There’s a weird, bimodal distribution – what in the world could cause that. The first guess is that Old Faithful doesn’t just erupt – it erupts in two different ways. We’ve seen that the data points, when graphed, look like they form two completely different groups. It looks like it, but can we be certain that that’s what’s happening?

Our guess, then, is that what looks like two different kinds of eruptions are actually two different kinds of eruptions.

The two biggies: alternate and null hypotheses.

There is a protcol for modern research – a “right way” – and there are both imaginary (philosophical) and real (probabilistic) reasons for the way that modern researchers do things the way they do them. We actually need at least two hypotheses and the one I stated above is far too sloppy to be of much use. First, in that form, it can’t be measured – what do we even mean by “two different kinds of eruption”. Well, the data we have gives us a clue. The interval between two eruptions seems to be related to the duration of the first eruption, and the data seems to divide at 3 minute eruptions, so let’s speculate given what we have and come up with something more number-like and testable – more “qualitative”.

We suspect two groups – there must be something different about them. There are three characteristics of data that present themselves as testable properties – means, variances, and shapes of distributions. Let’s work with the means: “The difference between the < 3 minute group mean and the >3 minute group mean is not zero.”

That is called the alternate hypothesis because the hypothesis that we are actually going to test, the null hypothesis is: “The difference between the < 3 minute group mean and the >3 minute group mean is zero.”

Now, why would we want to try to prove that what we think is happening is not actually happening?

Actually, there are at least two reasons: one philosophical and one practical.

Philosophically, it keeps researchers humble. If you do your best to disprove your pet theory and you can’t, you can at least say you tried, but, oh well, you were right all along (sigh).

The practical reason is that it is easier to disprove something than to prove it. All you need is a counter example.

Types of errors

There are two types of statistical errors that can occur when testing hypotheses. You can reject a null hypothesis when it is, in fact, true. That’s a type I error. A type II error is when you do not reject a null hypothesis which should be rejected.

The probability of a type I error is called alpha. The probability of a type II error is called beta. The probabilitty of not making a type two error – in other words, the probability of rejecting a null hypothesis correctly, is equal to 1-beta, and is called the power of a test.

The inscrutible p-value

If you’re reading this, I’m going to asume that you’re interested in statistics and that you have seen the term “p value”. It’s probably the most common term in research reports and trying to get across what it is is one of the most difficult thinks in teaching statistics, evidently. But it is quite simply the probability of making a type I error. A p value of 0.05 means that there is a 5% chance that, if you reject your null hypothesis, you will be wrong. The smaller the p value, the safer you are in rejecting the null hypothesis and accepting your alternate hypothesis.

In fact, the arbitrary cutoff usually accepted in the research world for a “good p value” is 0.05. Most statistical software packages assume that you want a p value of 0.05 or smaller unless you tell them differently. They also usually provide an optional 0.01 or let you choose your own (in the really sophisticated packages.) Of course, you can always just look at the actual p value and decide if it’s good enough.

In cases where you really, really want to be sure, as in medical trials where peoples’ lives are at stake, the more stringent p value of 0.01 will be required for an outcome.

So, if you didn’t know what a p value was before, you do now. Don’t forget it because you most certainly will see it again.

Significance and numbers of tests performed – why you plan ahead

It is considered appropriate to plan the groups and variables you are going to compare before hand – this is called a priori comparisons. But, what if you don’t find what you’re looking for? You just keep looking, right?

Wrong. The other kind of comparisons is called ad hoc comparisons – those are group differences you decide to check out after your planned comparisons are done. There are some reasons that you might appropriately look at other group differences after you planned study is over but, “I didn’t find what I want,” is not one of them. There are a couple of problems with this approach.

First, it’s just dishonest. A researcher plans a study in a way that they intentionally try to disprove their alternate hypothesis – if they have disproven it, they have met their goal. If they turn tail and go back to beating that dead horse, it can only be to try and revive it (whoa! Was that a mixed metaphore?!)

The second reason has more substance. By pure chance, if you keep on looking hard enough, you’re going to find what you’re looking for, even if it’s not really there. After all, even at an p value of 0.99, you have a 1 out of 100 chance of seeing a null hypothesis that should be rejected.

There are multiple comparison procedures (that’s what they’re called – multiple comparisons) that corrects outcomes aaccording to the number of analyses performed. If you finish your study and you can’t reject your null hypothesis, go ahead and report your results, but you might notice something in your graphs that piques your interest. Should you ignore that?

No, but be sure you take into consideration the number of tests you have run when checking it out, and report how you did so.

Significance and strength of relationship

The significance obtained in a test is not the strength of the relationship between the variables tested, but significance and strength of relationship are not completely unrelated. Another way to look at significance is as a measure of the reliability of the test outcome. If you run the same test over and over (say, in a Monte Carlo simulation), the p value of the test tells you how strongly you should expect the same outcome, and how confident you can be in generalizing from the sample to the population.

If two variables are strongly related in a sample, there is a good chance that they will be strongly related in the population; if they are poorly related in the sample, there is a poor chance that they will be strongly related in the population. Therefore, significance and strength of relationship between variables are not independent.

Significance and size of samples

The problem with what was just said is that it’s only reliable when sample sizes are kept constant. According to varying sample sizes, a strong relationship may be very significant or not significant at all. Think about a correlation between two variables in a scatterplot. The correlation coefficient tells you (literally) how tightly data points cluster around a straight line.

Take a sample of 2 data points. Regardless of which two you choose, they could be from two completely random variables and you could always draw a straight line between them. With two data points from commpletely indepedndent variables, the correlation coefficient will be 1, indicating a perfect relationship. What does that tell you? Absolutely nothing!

As you keep adding data points, if the relationship remains strong, then the significance of the result increases.

What if there’s no relationship?

Typically, “no relationship” is a nonevent – a failure. The implication is that the researcher thought that they knew what was going on but was “blown out of the water” by the statistics.

That is most certainly a sad state of affairs, that a professional could be so let down by such an exceptional thing. “No relationship”, on the contrary, is an exciting result.

When you expect a result that doesn’t happen, you should be driven to ask why. What is really going on? What keeps research from being boring is that any good study should always generate more questions.

“No relationship” should be a surprise.

Sample to population

The primary purpose of inference is to answer questions about populations. Samples are what you’re stuck with – inference is the magical link from samples to populations. But the only way that works is if the samples you use are random and large enough to “fill in” for the larger populations.

How confident are you?

So how confident are you that what you see is actually what is there? Our discussion on Sampling and estimation gives us what we need to answer that. Sample statistics behave in predictable ways.

Let’s say that you have a sample with a mean of 5 and a standard deviation of 2. There are 35 data points. You already know the error inherent in this mean – it’s the standard error and it’s the standard deviation divided by the square root of the number of cases: 0.34.

(Technically, the standard error is the population standard deviation divided by the square root of the number of cases but, since we don’t know the population standard deviation, we assume that the sample standard deviation, since it’s a “good sample” is “good enough”)

So we know that the true mean of the population will be between 5+0.34 and 5-0.34 68% of the time – we are 68% confident that the mean will be between 4.66 and 5.34. In addition, we are 95% confident that the true mean will be within 2 standard errors of the sample mean: 4.32 to 5.68. We could go on to say that we are 99% sure that our mean is between 3.98 and 6.02.

Our favorite alpha is 95%, so where is our confidence interval where we can be 95% certain that our mean is captured? 95% of the sample means will be within 1.96 standard errors of the mean we have. That means our 95% confidence interval is 5±1.96*0.34 or 4.33 to 5.69.

You should memorize that 1.96. It’s a popular number in statistics. It is the number of standard deviations from the mean that captures 95% of a normal distribution. Actually, you can calculate the confidence interval for any alpha by looking up the value in a standard normal distribution above which that percentage of values fall. On a Calc spreadsheet, the function NORMSINV will give you the number. In other words, if you want the the number of standard deviations that captures 99% of a normal distribution: =NORMSINV(0.99) returns 2.33. To calculate a confidence interval, multiply the number of standard deviations by the standard error and then add the product to the mean for the upper confidence limit and subtract it from the mean for the lower limit.

Two ways to test significance – p-value and confidence intervals

Now we have the basis for two different kinds of tests of statistical significance. Let’s take the last steps in developing statistical tests and use them to test our null hypothesis: “The difference between the < 3 minute group mean and the >3 minute group mean is zero.”

For the first kind of test, you need to know that well behaved statistical tests – the ones that are actually used (because who would want to use a poorly behaved statistical tests) – produce scores that follow well understood distributions. In the case of traditional tests (like the one we are about to use), they follow normal distributions, and since scores from a normal distribution are predictable, we can determine the probability that we will obtain a specific score.

We already know (from the page on Sampling and estimation) that the difference between two means is normal if the samples are large enough. Actually, a nice number for “large enough” is 30 cases. Below that, scores from statistical tests tend to follow a similar distribution called the Student’s t distribution. It really doesn’t matter when there are more than 30 cases because then the t distribution looks almost exactly like the normal distribution. How do we know what the distribution looks like? Because statisticians have used Monte Carlo methods like we used in the Sampling and estimation page and looked at the results.

We will use a t test to check the probability that the difference between the means of the two groups of intereruption times is actually zero. The idea is fairly simple, standardize the difference between the two means by dividing by the standard error. The simplest case is when the group sizes are equal and the variance of the two distributions are equal. We can’t say either of those things since the group sizes are not equal and we don’t know the variances. We could calculate them (and you can look back at the last section and see that they’re 6.6 and 6.9 – close enough) but, since I have a computer that will do the work, I might as well let it do all the drudge work. I’ll go into more detail below.

Here is DANSYS’ version of the t test for the two groups of intereruption times.

It contains both the results of the t test for unrelated groups and the confidence intervals for the means. The means for the two groups are 55.6 minutes and 80.6 minutes. The combined means aren’t important for us right now (pooled means and variances can be used for other procedures.). The difference is a whopping 33.1 minutes. That’s big, but is it big enough?

The standard deviations are different – 6.3 minutes and 8.1 minutes – and one assumption for the t test is homogeneity of variance between the two groups. The t test that DANSYS uses corrects for that. But the standard errors are very similar – around 0.6.

The value of the t test is 27 and, for a two tailed t test, the probability of rejecting a true null hypothesis is very (very!) small – 1.1×10-82. Assuming that these two groups are “normal enough”, we can safely reject this null hypothesis and say that, according to the independent groups t test, the two groups really are different. The difference between their mean is not zero.

What about the confidence intervals? This procedure, by default, gives you confidence intervals for an alpha of 95%, so we can be confident that we’ve captured 95% of the sample means within those limits. 95% of the group 1 sample means will be between 55.2 and 56 and the means of the other group will be between 80.2 and 81. The important thing here is that these confidence intervals do not even come close to overlapping. We can be 95% sure that the two groups do not share a common mean.

By the way, DANSYS allows you to choose your alpha in the TTEST2 function.

We have confirmation by two procedures – t test and confidence limits – that the two intereruption times are actually two different groups, driven by two different underlying dynamics.

Now to throw a glitch into the picture.

The assumption of normality

This is a parametric test, meaning that it relies on the distribution of residues in the groups being compared being normal, but we know that one of the groups, the < 3 minute eruption group, is not normal, but it resembles a normal distribution some. Is it normal enough?

In fact, the t test is fairly robust and the departure from normality has to be pretty severe to interfere. Still, if there is any doubt, there are two options. You could use a nonparametric test like the Mann-Whitney test we will be looking at below. Second, there are transformations that can be used to bring some non-normal data to normality. Often taking the logarithm of a badly skewed distribution will pull the heavy tails in and make the distribution normal.

If sample sizes are large, 30 cases or more, the t test is generally robust enough to ignore the assumption of normality because, as we’ve seen, as samples get larger, sample statistics tend to approach normality regardless of the underlying distribution.

There are tests of normality, generally called “goodness of fit” tests that compare data values to a normal distribution with an equivalent mean and standard distribution. We can use one on group one to see how normal it is.

Here is the results of the Kolmogorov-Smirnov test of normality (and, yes, that is the Smirnov of Vodka fame – he was also a statistician).

It’s a little confusing but the null hypothesis for this test is that there is no difference between the tested distribution and the normal distribution with the same mean and variance. So there is a very good probability (36%) that this distribution is actually normal if error is ignored.

t-test for independent samples

Before I go into how the t test I used to determine whether the two groups of eruptions were really different, I need to introduce you to one more very central concept – that of degrees of freedom.

Generally, degrees of freedom are the number of ways that a dynamic system can move. In statistics, the system is made up of individual data points so it’s not surprising that the degrees of freedom of a population is the number of data points. What confuses people is why the degrees of freedom of a sample is the number of data points minus 1.

The reason is that, in an estimate of a statistic, all the data points are counted except that of the statistic itself and since that is calculated from (is dependent on) all the other points, it cannot change. So that one data point (the mean) must be subtracted from the count of the others.

The degrees of freedom is the parameter that specifies a t distribution just as a normal distribution is specified by means and standard deviation, so, to determine a p-value for a t test, you need the value of the t test and the degrees of freedom.

There are three cases of independent variables that determine how a t statistic is calculated. The two variables can have the same number of individuals in each and the variances in the two groups can be equal. If the two groups have different numbers of individuals, the standard errors must be determined by the proportion of variance contributed by the unequal groups. If the variances are different, that really complicates things. One assumption of the t test is that the groups compared have equal variances and, if that is not the case the pooled variance must be used, and a correction must be applied to the degrees of freedom.

In each case, the t statistic is calculated by dividing the difference between the means by the standard error. Whether the population variances are assumed to be equal or not, the sample standard deviations will not be because of statistical error and the group standard deviations must be pooled to calculate the standard error (which is still the standard deviation divided by the square root of the number of individuals). Since the three cases differ in how many different group sizes and variances have to be dealt with, that is where the complications arise.

The simplest case is where the group sizes are equal, so there is only one group size to consider, and the variances are assumed to be equal. Here is the formulas for the t statistic and the degrees of freedom for the t distribution needed to calculate the p value.

The numerator is obvious. That’s what we’re testing – the difference between the two group means. The denominator scales the difference into a statistic that follows the t distribution. The “pivot point” is when the standard error is 1. In that case, the t statistic is just the difference between the two means. When the standard error is larger, the t statistic becomes smaller. When the standard error is smaller, the t statistic becomes larger.

As the standard error becomes larger and larger, which happens as the variance becomes larger or the sample size becomes smaller, the t statistic approaches zero, regardless of what the difference between the means is. So, what this statistic is actually measuring is how the sample size and variance interferes with how the sample statistics reflect the population parameters. As explained above, the error introduced by taking a sample from a population is from the loss of information when you sample. That’s why you can’t just take the difference between the sample means as the actual difference – there’s error.

Looking back at the section on Sampling and estimation, particularly the part about the standard error of the difference between two means, you’ll see that that is what the denominator is.

If the two groups being compared have different group sizes, the amount of variance contributed by the two groups must be balanced. Here’s the formulas for that.

Each group standard deviation is multiplied by the degrees of freedom of that group before they are added. That weights the variances of the groups by their group size. Also, the total degrees of freedom is the pooled degrees of freedom – the sum of the group sizes minus 2 (1 for each group), instead of the common group size doubled. This formula can be used in situations where group sizes are different or equal – in the latter case, it just boils down to the simpler t test.

Finally, if the variances are different, a correction must be applied to equalize them. The following formula is called the Welch-Satterthwaite equation.

The standard error is exactly the one described in Sampling and estimation for the difference between two means. The big difference is the monster of a formula used to calculate the degrees of freedom to use for getting a p value. You might be able to see that this formula just back calculates the degrees of freedom from the standard error (which is in the numerator).

This is the form used by TTEST2 and it can be used for all three cases of t test for two independent groups.

So, what if the two groups are dependent, for instance, what if you’re given the scores on a test and retest of the same subjects. The matched scores are dependent because, if the test is reliable, you would expect them to be close. This is an alternative approach to the test-retest correlation coefficient we talked about on the Descriptive and exploratory statistics page. The t test to use is the t test for dependent samples.

t-test for dependent samples

There are two situations where you might want to use a t test for dependent samples. Two groups may be composed of different individuals but they may be matched on certain variables so that, if there is a difference, then it’s only in the variable being tested – we’ve talked about matching in the Sampling and estimation page. The other, more common, situation is when the same subjects are measured on two different quantities – either as a test-retest situation, or two different variables that you want to compare their performance on. The important distinction in the latter case is that there are two scores for each subject.

You’ll recognize the form of the formula for calculating the t test value and degrees of freedom.

The numerator of the t value is simply the average difference between the matched pairs. The difference from the independent measures t test is that the unit being analyzed here is not individual scores of individual subjects, but the difference between the paired scores. Similarly, the denominator is not the standard error of the mean of the measures, but the standard error of the mean of the differences. As usual, the degrees of freedom used to derive a probability for the test is the number of data points (in that case, the number of pairs) minus 1.

Using this formula, you test the null hypothesis that the difference between matched pairs is 0. If you want to test for a nonzero value, you can easily do that by first subtracting the mean difference by the value you want to test for.

For all these t tests, there is a more sophisticated test that allows you to test differences between more than two groups at a time. That boils down to analyzing the relationship between a continuous variable (such as time between eruptions) and a nominal variable with more than two categories (actually, these procedures can be used with nominal variables of only two categories, but the t test tends to be quicker and easier.) We’ll be looking at them in more detail in a later section but, as a preview, they do not test differences in location (mean). They test differences in variance between groups, so they are called analysis of variance (or ANOVA) and analysis of covariance (or ANCOVA).

Until researchers realized that you can analyze variance and a whole lot more using regression techniques, ANOVA was the workhorse of research statistics. I will wait to look at a dependent variables example until we get there, but keep in mind that, for simple cases where there are only two categories in the nominal variable, the t test can be a lot easier to run and interpret.

Is the sample mean the same as the population mean?

You can also use a t test to see if a sample mean is different from a population mean. For instance, is the rate of beer drinking among college sudents at a particular college greater than in the general population? The type of t test is a single sample t test and the formula is pretty simple. You will recognize it immediately.

Divide the difference between the sample mean and the population mean by the standard error of the mean. The degrees of freedom to use to determine a p value is, of course, the number of individual cases in the sample minus 1.

Censuses and medical studies generatee a lot of population data so, if you can get the mean of a sample of a particular subpopulation, you can often determine if they are “special”. It makes for a rediculously easy study.

The multiple group t-test

Analysis of variance was developed as a test to evaluate whether more than two groups are different but there is a difference. The t test compares the means of the different groups to see if they are really different. Analysis of variance calculates how much variance in the whole set of data that is contributed by each group.

Between the t test and the analysis of variance is the multiple group t test. In DANSYS it is provided by the MeanComp function.

This procedure tests whether each group mean is the same as the overall mean. It’s useful if that’s all you want. ANOVA gives you a lot more information – sometimes a lot more than you want. You’ll see what I mean when we look more closely at ANOVA and ANCOVA.

The significance of a proportion (also PCONF)

I’ve mentioned that a good place to start in dealing with proportions is the binomial distribution since binomial data is usually proportions. A binomial process produces one of two outcomes – hit or miss, and the data are the proportion of hits to total outcomes. So let’s look back at the binomial distribution.

A particular binomial distribution is characterized by two parameters – the number of trials in an experiment (“Throw a dice 5 times.”), and the probability of a hit on any given trial (“The probability of getting a 4 on a throw of a dice is 1/6.”) The mean of a binomial experiment is the number of trials times the probability of a hit. The variance of a binomial distribution is the product of the number of trials, the probability of a hit, and the probability of a miss (which is 1 minus the probability of a hit.)

When we are talking about a proportion, we aretalking about a single trial. The test for a proportion is analogous to that for a mean – divide the difference you are testing by the standard error. Usually, when a proportion is tested, it is tested against a suspected proportion.

Say that, in a class, you expect 80% of 50 students to get a passing grade on a test but you are rather disappointed to find that only 76% pass. Should you be disappointed? The difference is only 4%.

That will be the numerator of your test statistic. The denominator will be, as usual, the square root of the variance divided by the number of cases. The variance is the proportion of hits times the proportion of misses or 0.76*0.24 = 0.18. That divided by 50 is 0.0036. And the square root of that is 0.06. Setting the difference of the observed proportion against the expected proportions, you get 0.66. That is your test statistic and, not surprisingly, given the central value theorem, it follows a normal distribution (in fact a standard normal distribution – it is a z score.)

So, looking up this value in a standard normal distribution table or using a standard normal distribution function on a spreadsheet, the probability of seeing this size difference by pure chance is 0.74, so this difference is not significant.

If you want a confidence interval, you can multiply the denominator of the test statistic (the standard error of the difference of the proportions) by 1.96 to find how far from the mean difference you can expect a score 95% of the time. That will give you 0.12 and that would be a confidence interval of -0.08 to 0.15. As you can see, 0.04 is well within these confidence limits.

Testing the difference between proportions

The form of the test for the significance of the difference between two proportions should not be a surprise since you know how a binomial distribution works and you know how to test the difference between two means.

The numerator is the difference being tested. The denominator is the square root (as usual) of the sum of p(1-p) over the size of group one and p(1-p) over the size of group two. In this case p is the combined probabilities – (n1p1+n2p2)/(n1+n2). n1 and p1 are the size and proportion of group 1, and n2 and p2 are the size and proportion of group 2.

As before, this is a z score, so you don’t need the degrees of freedom.

Comparing proportions

You can compare the proportions from several groups. DANSYS has a GroupComp function to do it autometically.

It tests the hypothesis that there is no difference between each group proportion and the overall proportion of all the groups, therefore, you get a significance level for each group.

The procedure is somewhat more complicated so I won’t go into details, but you end up with statistics that follow a chi square distribution and a p value.

A nonparametric test – Mann-Whitney

One popular nonparametric test is the Mann-Whitney U test.

The idea is that, if you throw all the data values into one pot and sort them. if the groups are similar, the values from each group should be evenly distributed through the ranks. If one group has more of the higher ranks, that looks like the groups are different.

To test the hypothesis that the medians of two groups are equivalent, after the data points are ranked, the ranks are separated back into their respective groups and summed to obtain two group sums. To get the test statistics, each sum is subtracted from the product of the two group sizes plus the average of the respective group size and that group size minus 1. In other words, if R1 is the sum of the ranks of the first group, and n1 and n2 are the sizes of the groups, U is n1n2+(n1(n1+1))/2 – R1, and U’=n1n2+(n2)n2+1)) – R2.

Also notice that U’=n1n2-Um and U=n1n2-U’, so you only have to calculate one of the test statistics. The other can be calculated from it. The U statistics follow a predictable distribution and can be evaluated directly. If there are more than 20 data points in either group, they can also be normalized, in which case, you can use a standard normal distribution to get a significance value.

To standardize the U statistics, calculate n1n2(n1+n2+1). Divide the last result by 12 and take the square root. Then divide U-n1n2/2 by the result. This is a z score.

Although the intereruption time data for the two types of eruptions are normal enough to use the t test to check whether their means are significantly different or not, let’s look at the results of the Mann-Whitney test and see if they agree.

Although there can’t really be a p value of 0, these test statistics are so diverse that DANSYS can’t distinguish between the actual p value and 0. Obviously, we have to reject the null hypothesis.

Another nonparametric test – Wilcoxon

The Wilcoxon sign test is a nonparametric test used to test whether two groups of related data values are statistically indistinguishable. Remember that related data values are paired because they are scores from the same individuals – test-retest, matched samples data, or multiple tests given to the same subjects. Regardless, each subject will have two scores and the goal is to see if these scores are the same.

To calculate the test statistic, the difference of each pair is calculated. It doesn’t matter which group is subtracted from which as long as the order is maintained – either the scores of the first group are subtracted from those of the second, or the second from the first throughout. The scores are then ranked ignoring the signs. Zero differences are removed.

Now, the ranks are separated into two groups, the positive values and the negative values. And the ranks are summed for each group. The smallest sum is the test statistic. The Wilcoxon statistic follows a predictable distribution (and, yes, there are tables and there are spreadsheet functions to calculate p values for Wilcoxon statistic values). The statistic value and the number of non-zero pairs give you the significance.

Both Wilcoxon and Mann-Whitney are easy to calculate and, with the Internet, tables of the values are fairly easy to find, and, of course, statistic software, like DANSYS will do all the work for you.

Comparing histograms

There is a visual version of confidence intervals and I bet you could guess what it is – I mean, if I hadn’t already told you in the title of this section. Yes, you can just compare the histograms of two groups to see if the means of the groups fall reasonably close together to assume that any difference is just random error.

Here are the two histograms of the < 3 minute duration and >3 minute duration intereruption times of Old Faithful side by side, but if you look a little closer, you will see from the horizontal axes that they belong side by side.

They don’t overlap at all. It’s pretty obvious that these two groups don’t overlap at all and if you locate the means and one standard deviation from the mean values, you will see that even two standard deviations from the mean intervals will not overlap.

Of course, not all stacked histograms will be this obvious but it’s going to be possible to eyeball where confidence intervals are and where they overlap.

Poisson data

So now you know the pattern. Standardize the sample statistics you want to test by dividing it by the standard error and check it against the t distribution it lives in (you’ll need to know the degrees of freedom) or the normal distribution if the sample is large. Or you can generate the confidence interval by adding and subtracting the z score for your confidence alpha times the standard error to the sampling statistic.

Since sampling statistics tend to follow normal distributions regardless what kind of distribution the original data follows, this approach can be taken with just about any kind of data. For instance, poisson processes arise occasionally.

Say that, in county A 3 cows have been struck by lightning in the last 5 years, and in county B there have been 5 cows similarly killed in the last 10 years. Dead cows (at least those killed by lightning) would follow a Poisson distribution, so how would yu test to see of the rates of electricuted cows in the two counties are significantly different.

First, the null hypotheses is that the rates of cow deaths is the same for both counties. To make the basis of “hits” proportional for both groups, we will have to equalize the difference. We can do that by multiplying the hits for group 1 by t1/t2, or equivalently, by multiplying the hits for group 2 by t2/t1, where t1 is the time period for group 1 and t2 is the time period for group 2. In other words, our null hypothesis is the cX-Y=0 where X is the number of hits for county A and Y is the number of hits for county B. c is still t1/t2. in our case, that would be -1.

The standard error of the difference is the square root of c^2 times the hits from county A plus the hits from county B. That would around 2.4. So the z score that you would test is the difference over the standard error or 0.42 (you can ignore the negative sign in a standard normal distribution because its symmetric around the mean 0.) The p score (from a table of z scores or a standard normal distribution function on a spreadsheet) is 0.34. We can’t reject that null hypothesis on that basis!

If you wanted to construct a confidence interval at an alpha of 95%, the standard z score for that alpha is 1.96 and the standard error is still 0.42. The product is around 0.82 so the confidence interval for the difference is -1±0.82 or from -1.82 to -0.18. The test statistic, -1, is right in there so our confidence interval agrees strongly with our p value.

A brief look at nominal and ordinal data.

Most of the tests we’ve looked at above is for continuous data but binomial and Poisson distributions are discrete distributions and are all about counts. Even though proportions are fractions, they are used to compare counts.

Nominal data are counts and ordinal data are rankings. Groups of nominal data are often compared using goodness of fit tests. Dependence between groups of nominal data is usually tested using contingency tables and statistics that accompany them, or in the case of more than three groups, special forms of regression analysis. Ordered data is often analyzed using rank order statistics, but many of those are associated with contingency tables.

In the next section, we will be looking at how to deal with nominal data, but we will also touch briefly on ordinal data, saving a later StatFile for a more detailed look.

There’s a lot more out there

You can generate a p value for just about any statistics. Alternaively, you can create a confidence limit for just about any statistic.

Sometimes, it’s not absolutely straight forward. You need a standard error value and, sometimes, a standard error doesn’t exist, for instance, for some measures of association for nominal data, there isn’t an actual standard error, but there is a value that the standard error of the statistic approaches and that can be used for the calculation of p values and construction of confidence intervals. If you see the term “asymptotic standard error”, that’s what it means.

Don’t feel limited by the “traditional” statistics, there are a lot of procedures out there, and the more you know, the more you can do. Current statistics packages software is pretty sophisticated and you’re usually only limited by your knowledge of how to use it and how to interpret the results.

Also, don’t feel limited by your bank account. There are many free software packages (like DANSYS). Just do a web search for “free statistical software”.