Correlation

What is correlation (and what isn’t it)?

The United Nations monitors several worldwide variables to track the effects of poverty, education, disease vectors. Here, we’ll be looking at some data collected in 1990. Data on eight variables were collected from 97 countries to determine the effects of poverty on populations. These eight variables are: Birth rate Death rate Infant mortality Life expectancy – male Life expectancy – female GNP Region Country

This data was published in the Journal of Statisics Education and I pulled it from Dr. John Rasp’s Statistics Website: Data Sets for Classroom Use.

http://www2.stetson.edu/~jrasp/data.htm

GNP is the abbreviation of “Gross National Product” and it is an indicator of how well a country is doing economically and, thus, indirectly of poverty in the country. Intuitively, we would expect peoples’ physical health to vary their economic health, but intuition can be misleading. Is that really the case?

Statistics has a tool that can be used to explore this question.

Science has long been concerned with covariation. Newton found that, as the mass of two objects increases, the gravitational attraction between them also increased. That is called direct variation or direct relationship. He also found that gravitation force decreased as the square of the distance between the two objects increased. Force and the square of the distance of separation, therefore is indirectly or inversely proportional – they vary indirectly, they have an indirect relationship. There are things that show no pattern of variation at all – they are not related statistically.

Correlation is a parameter or statistic (remember, statistics belong to samples – parameters belong to populations) that quantifies relationships.

What we want to know is if there is a relationship between poverty and things like birth and death rate and infant mortality, and, if there is, how strong are the relationships?

There is a very important thing that correlation is not – correlation is not causation. Correlation only shows that there is a pattern in variation between two variables. If there is, one of the variables may cause the other, but it is also possible that the observed variation is completely caused by another unobserved variable and, left to themselves, the two observed variables are completely unrelated. It is also possible that an observed relationship is completely coincidental.

Can we tease out these interwoven snarls of causality. Well, yes, but that is a long, involved tale of mystery and intrigue. Come with me now into the spine tingling world of……correlation.

Looking at correlation – the scatterplot

You can see correlation in action in a graph called a scatterplot. It is simply a Cartesian graph. The values of one variable is plotted on one axis, the other variable on the other axis. Each point is a pair of data values. If the variables are distinguished as independent and dependent variables, the independent variable is conventionally plotted on the horizontal axis, and the dependent variable on the vertical axis.

Here are two scatterplots. What does the first one look like?

It looks like a cloud of aimless points, right? That’s what it is – random. I generated two columns of random numbers and graphed the pairs on a Cartesian graph. The two variables (columns) are not related because they are random. The correlation between the two is zero.

What about this one?

Straight line. It is a graph of the function 2x+3. The two variables are completely related by a linear function and their graph is a straight line. The correlation between the two is 1. A correlation of -1 would also look like a straight line, but the slope would be in the opposite direction, a line running from the upper left to the lower right.

Correlation varies from a perfect inverse relationship which looks like a straight line sloping from the upper left to the lower right; through no relationship, which looks like a cloud of random points; to a perfect direct relationship, which looks like a straight line sloping from the lower left to the upper right. In between are imperfect relationships. Here’s one.

This is a scatterplot of two of our variables – Male life expectancy and female life expectancy. The points are clustered around a straight line, but not exactly on it. Also notice the positive slope. As might be expected, male life expectancy varies with female life expectancy. The correlation will be between 0 and 1.

Being able to eyeball the relationship between two variables is all well and good, but we want something a little more precise – some number – a correlation coefficient. So how do we calculate such a number?

The classic – Pearson’s product-moment correlation coefficient

We looked at the Pearson product-moment correlation ccoefficient when we talked about descriptive and exploratory statistics, but let’s breifly go back over it because it’s quite important and the basis of many other correlation coefficients.

We have been talking about how variables covary – how they vary with each other. There is a statistic called covariation. For two variables, subtract the variable mean from each value; then for each pair of deviations from the means, multiply the two numbers for each observation and add all the products. Divide the result by one less than the number of pairs.

Now, think about how the result will look. Each difference of value from mean will be positive if the value is larger than the mean, or negative if the value is smaller than the mean. If both values in a pair are on the same side of their respective means, that means that they are directly related and the product will be positive. If they are on the opposite sides of their means, the product will be negative. If there are more positive products than negative – a generally positive trend – then the sum will be positive. If negative products predominate, then the sum will be negative and an inverse relationship will be indicated. The larger the number, the stronger the relationship will be. If the number of positive deviations (distances from the mean) completely balance the negative deviations, the sum will be zero and it’s average will also be zero. So for a covariance, zero indicates no relationship.

The problem is that there is no upper or lower limit to the size of the statistic so, although a covariance of 0 is clear, and it’s also clear what a positive or negative value means, it’s not so clear what a covariance of 37 or 152, or -33 means. We need a number that will fall between some definite values. We’ve already noted that a correlation of 1 indicates a perfect direct relationship and -1 indicates a perfect indirect relationship. That’s what we want.

The way we scale the unbounded covariance down to a value that varies between -1 and 1 is, we divide the covariance by the product of the standard deviations of the two variables.

The reason that works is that a variance (including a covariance) is the square of a standard deviation. We have a fraction now with the product of two standard deviations (a variance) in the denominator and a variance (covariance) in the numerator. The covariance can’t be larger than the product of the standard deviations (the “pure” variance of the two variables) and if they are equal in value (regardless of sign), the ratio will be equal to 1. The meaning of the sign and zero is still the same as with the covariance, so we now have our measure of covariation that varies between -1 and 1. It is called the correlation coefficient. Specifically, I have described the Pearson product-moment correlation coefficient.

There are several formulas for calculating Pearson’s coefficient. Today, it’s usually done by computer.

As a rule of thumb, a significant correlation (we’ll talk about significance below) of 0.3 or less indicates a mild relationship. Between 0.3 and 0.6 is a moderate value. Larger than 0.6 indicates a strong relationship. For a more precise interpretation, the square of a correlation coefficient represents the proportion of the variation in a dataset that is shared by the two variables. That is the strength of the relationship. (The square of a correlation coefficient is called the coefficient of determination.)

Other kinds of linear correlation

Pearson’s coefficient is specifically for continuous or dichotomous variables (when Pearson’s coefficient is calculated for two dichotomous variables, it is equal to the phi coefficient that we talked about in the page on analyzing nominal variables. It is also somewhat sensitive to departures from normality (we use “normal” statistics like arithmetic averages and standard deviations to calculate it.) Other kinds of data need other kinds of correlation coefficients.

We talked about measures of association the last time. Many of them, such as phi and Cramer’s V are correlation coefficients as they tend to return the same results that you would get if you just calculated a correlation coefficient from the raw data. Their advantage is that they are easy to calculate from the contingency table.

Ordinal data really needs something else to capture the information in the ordering. Several of those are also concidered in the Analyzing nominal data section. If you look back there, you will see that, instead of evaluating how many data pairs are on the same side of their respective mean, they look at how many pairs agree or disagree. Otherwise, they act like Pearson’s coefficient. The most popular coefficients for ordinal data are Spearman’s rank order correlation coefficient and Kendall’s Tau.

There are several coefficients that take advantage of the special properties of dichotomous data. The point-biserial coefficent is useful when one variable is dichotomous and the other is continuous. The biserial correlation is for two dichotomous variables – one can be an artificially dichotomized continuous variable, such as test results that are scored as “pass” or “fail”. The tetrachoric correlation is used when both variables are dichotomized, continuous variables.

Adding and subtracting correlation coefficients

Correlation coefficients are not linear functions of the relationship between two variable, therefore, they are not additive. You can add or subtract correlation coefficients but the results won’t mean anything. And since the sum of correlation coefficients is meaningless, so is the average. But correlation coefficients can be transformed so that they are additive.

The coefficient of determination mentioned above – the square of the correlation coefficient – is additive. Normal scores are also additive and correlation coefficients can be transformed into normal scores using the Fisher transform. The Fisher transform is the arc hyperbolic tangent of the correlation coefficient or:

These transformations come in handy when we begin constructing statistics to say, compare correlation coefficients between groups of people.

The inverse of the Fisher transform is just the hyperbolic tangent of the z score.

A simpler z transformation can be accomplished by multiplying a correlation coefficient by the square root of one less than the number of observations. This is only good for 30 or more observations. For less, you can obtain a t score by multiplying the coefficient by the product of two less than the number of observations, and one minus the square of the coefficient.

Generally, a correlation coefficient is a correlation coefficient is a correlation coefficient, and once you have calculated one, you can handle them all pretty much the same. The transformations work for all of them. The tests of significance we are about to look at works for all of them. And you can calculate partial, part, and multiple coefficients with all of them (we’ll talk about all that, too.)

Sample size

Like most statistics, the larger your sample, the better off you are with correlation coefficients. Intuitively, it makes a lot of sense. Think about trying to do a correlation between two data points. You can always draw a straight line between the two points indicating a perfect linear relationship. Regardless of how strongly the two variables are actually related, you will always get a correlation coefficient of 1 with two points. Larger groups of unrelated points will look more and more “uncertain”.

But correlation coefficients are also affected by the central limit theorem, so the larger the sample is, the more normal the distribution of the sample correlation coefficients will be. Also, you might have noticed that the significance test for small sample coefficients is based on the t distribution. That shouldn’t surprise you after reading the section on hypothesis testing.

In general, with samples of 30 or less, you want to think about using a t distribution test to get an idea about the significance of a correlation coefficient. For samples of 50 or more, serious bias will be unlikely. You don’t have to worry at all about bias with samples of 100 or more.

The significance of a correlation

So, just like all the other statistics you should ask whether a specific coefficient is significant. Let’s see if the correlation of the country GNP and it’s death rate is significant.

The Pearson correlation is -0.3, and, since we have more than 30 cases, we can easily find a z score. It’s -2.87. Z scores follow a standard normal distribution (a normal distribution with a mean of 0 and standard deviation of 1), so 95% of z scores are 1.96 points from the mean of zero. -2.87 is far beyond that so we are justified in rejecting the null hypothesis (r=0) and assuming that the relationship is, indeed, significant.

Standard errors and confidence intervals

Since the Fisher transformation is a z score, it can be used to construct a confidence interval. The standard error is the inverse of the square root of number of observations minus 3. So, the Fisher transform for our coefficient is (artanh(r)) and the confidence interval is that plus and minus 1.96 (for a 95% confidence interval) time the inverse of the square root of (97-3) or 0.52 to 0.10. This is Fisher transforms so, to get back to correlation coefficients, we need to apply the Fisher inverse (the hyperbolic tangent) to both limits to get 0.47 to 0.10 and, since 0 is not in that interval, we can be 95% sure that the correlation coefficient is not 0.

How do outliers affect a correlation coefficient?

Here is some data. I’ve thrown a extreme point (2,57) into the mix to show what happens to a correlation when an outlier is present.

Now, look at the scatterplots.

Remember that correlation is related to how far data points deviate from a best line of fit. It’s not surprising that a single extreme point would pull that line of fit off kilter, but how far.

The coefficient of determination tells how closely the data fits the line of best fit and, remember, it is equal to the correlation coefficient squared. So the coefficient of determination for the data without the outlier is 0.88. With the outlier, it jumps to 0.36. That’s quite a jump!. The correlation coefficient goes from 0.94 to 0.60. So you can see that one “bad” point can easily spoil the whole bunch.

So an outlier here, as anywhere else prompts us to make a decision. Is it a mistake and can we just, conveniently, remove the case? Or do we have to include it. There are several ways to do that. We could weight the data so that the outlier is downplayed. There are also robust procedures that do not react so strongly to extreme values. We will look at an instance of robust regression when we talk about regression analysis. That can lead to a less sensitive measure of correlation.

Comparing correlation coefficients

There are times that we want to compare correlation coefficients, for instance, we know that death rate is related to GNP. But the connections could be complicated and different for different regions of the world. For instance, is the correlation between death rate and GNP in Africa the same as the relationship in South America?

The problem is that correlation coefficients aren’t comparable. You can’t add or subtract them and they are affected by the size of the samples. But we already understand that we can transform correlation coefficients into z scores and those are comparable, being from normal distributions with a mean of 0 and a standard deviation of 1.

The correlation coefficient for African countries is -0.53 and for South American countries it’s 0.05. That looks different – but with such small sample sizes, there’s still a question of significance. The Fisher z scores are, respectively, -0.6 and 0.05. the difference is 0.65. To standardize this z score, we must divide the difference by the sqaure root of the sum of the reciprocals of the numbers of cases less 3 (countries). That value is the square root of 1/(n1-3) plus 1/(n2-3) or 0.39. The result is 1.65. This is less than 1.96, so we, surprisingly, can’t reject the null hypothesis. The confidence interval for the Africa coefficient is -0.99 to 0.88 and that for South America is -0.96 to 0.96 almost exactly the same 95% confidence intervals.

Is the relationships or male life expectancy with GNP significantly different from that of female life expectancy with GNP?

For that, we need three correlation coefficients: Male Life Expectancy x GNP (0.64), Female Life Expectancy x GNP (0.65), and Male Life Expectancy x Female Life Expectancy (0.98). We can transform the difference between the two correlations of interest into a t score by comparing them through the common correlation. It’s a rather drawn out procedure, but many statistics packages, including DANSYS will do all the work so I can tell you that the t statistic value is 0.46 and, at 88 degrees of freedom, the probability that we might reject a good null hypothesis of “no difference” is 0.65. Briefly, there is no significant difference here.

How do we deal with more than two variables?

Our data set has 8 different variables and, up to now, we’ve only been able to work with two at a time. What if we want to know the relationship between all the variables.There are actually several options. The first is to look at the relationships between each two variables at a time. We can literally look at the relationships using a graph matrix – that is an array of scatterplots. Here’s what our data looks like in scatterplots.

There are a lot of interesting things to see here. First, notice that this is half of a symmetric matrix of charts. If it were complete, the top row would contain the same charts as the first column. The diagonal would all contain straight diagonal lines since they would present the relationships between the variables and themselves. I added the bottom row as a log trasform of the GNP charts since the range is so broad that it’s hard to see what’s going on. Each variable has its own row and column, so there is a chart for each pair of variables.

You could spend days in this chart but let me point out a few interesting points. Death rate and birth rate are certainly related, but the relationship is certainly not linear. As birth rate increases, the death rate initially decreases, but around a birth rate of 28, death rate begins to increase. That seems to suggest that there is an optimal birth rate around 28 live births per thousand of population per year.

All of the variables seem to be related. The charts in the life expectancy row look almost horizontal, but be sure and look at differences in the scales of the axes. The slopes of the lines are respectably negative except for the relationship between male and female life expectancies.

The GNP variable produces a funny, L-shaped curve, but notice the range – from 0 to over 30000. Most of the countries are at the lower end of the range, but some have very large GNPs. That’s why I included that last line that emphasizes the structure of the points clustered down around zero.

The Region row also looks sorta funny, but that’s because Region is a categorical variable. These plots are actually what’s called “dot plots”, which show the spread withing different categories of the variable Region. One thing that is clear, there are big differences between the regions in the variabilities of all the variables. Note that the regions are approximately countries on different continents.

You can do this same thing with correlation coefficients. Below is a matrix of correlation coefficients between pairs of variables.

Here, we have two matrices. The top one is the correlation coefficients; and the bottom is the significance of the correlations. Most of the correlations are significant, as we would expect from the charts. You should suspect the Region values, though. These are Pearson product-moment correlations and they are not designed for nominal variables like Region, but they do indicate that there might be a relationship that you could investigate using a more involved regression technique or perhaps clustering.

Notice that birth and death rates, while having a respectable relationship, is no where nearly as strongly related as the other variables, but remember that Pearson’s coefficient only indicates linear relationships so, where there is a significant linear relationship there, it doesn’t tell the whole story.

Surprisingly, GNP and death rate do not relate that strongly and neither does GNP and geographic location. There are some interesting causal structures here but we can ferret them out if we can get a grasp on how more than one variables interact. There are ways to do this.

Partial correlation

Any or all of our variables may affect any or all of the other variables. Although correlation does not equal causation, there are ways to tease out the causal structure among a set of variables, assuming you have all the variables covered that interact significantly. Let’s make that assumption with our country data and explore the causal structure of poverty and economy. Note that this kind of study tends to be large and I won’t pretend to cover it fully here, but I would like to give you a taste of how it works.

Our first tool is correlation and we’ve looked at it in some detail, but simple correlation will only tell you about relationships between pairs of variables and it won’t effectively unravel relationships between multiple variables. To begin doing that, we can look at what happens when two variable interact, holding a third variable constant to eliminate the effects of that third variable. That is called “partialling out” the third variable and the statistic used is the partial correlation coefficient. Although any correlation coefficient can be used, we’ll stick with our Pearson product-moment correlation coefficients.

Here is the formula for a partial correlation coefficient:

First a little on the symbols. rac is the correlation coefficient between the variables a and c. rab.c is the partial correlation coefficient between the variables a and b with the effects of c removed – “partialled out”. Intuitively, you might see that we are subtracting the correlation coefficients for a and c, and b and c from the correlation coefficient for a and b, but we said above that you cannot add or subtract correlation coefficients. What the denominator of this fraction does is transform the correlation coefficients into z scores, which can be added and subtracted, so we are, in effect, subtracting the correlation of c with a and b from the correlation between a and b.

DANSYS has a function that will create a matrix of partial correlations from a correlation matrix. The function is called PCorr and this is the result from our data.

Whereas the formula above is for three variables, it can be repeatedly applied to more variables until whatever partial you want is reached. This matrix displays the partial correlations of all pairs of variables with all other variables canceled out. Notice that, where all the variables seemed to be related in the correlation matrix, here, some of the partial correlation matrices are very small. For example, look at birth rate and infant mortality. If they are related at all, it looks like they are related through some other variable. What’s a good candidate? Well, they both seem to be moderately related to female life expectancy and region.

Female life expectancy could, indeed explain birth rate and infant mortality. If women are living longer, there may be less pressure to bear children, and, certainly, the increased health of women indicated by the longer life spans would predict any increase in health of the children.

You can wander around through this table and find all kinds of interesting relationships. To really work out the causal structure of this data, you would need to calculate all the partial correlations of all combinations of variables. That’s not an inaccessible goal since computers make it so easy to generate partial correlation coefficients.

Part correlation

Where partial correlation removes the effect of certain variables on all other variables, part correlation removes the effect from specific (usually the dependent) variable(s). For instance, where a partial correlation might look at the relationship between a and b while eliminating the effect of c, d, and e on a and b, a part correlation would look at the relationship of a and b while neutralizing the effect of c, d, and e on a only.

There is some debate as to whether partial or part correlation is the most important, partial correlation seems to be the most popular form at the present.

Multiple correlation

If you want to know the relationship that all variables have on a partticular variable, you want to look at the multiple or total correlation. For instance, the multiple correlation for birth rate x all our other country variables is 0.92. If you square that, you get a coefficiet of determination of 0.84. In other words, 84% of the variation seen in birth rate in these countries can be explained by the other variables. What about the other 26% of the variation? That coulld be explained by random variation and latent variables – that is, variables that we didn’t include in our data set. Still, 84% is respectable.

A first look at causal analysis

We’ve seen how correlation coefficients and partial, part, and multiple correlation coefficients can be used to work out causal relationships between variables. Of course, as the number of variables increase, the job of teasing out the relationships gets exponentially large. For that reason, a whole science of causal analysis has developed using rather complicated computer tools like latent variable analysis and other forms of regression to let the computer do most of the work. We’ll be looking at causal analysis methods as we continue.

What if the relationship is not linear – eta

Some of the countries data are not linear, for instance, birth and death rate do not have a linear relationship. Since there is a significant Pearson product-moment correlation coefficient, we can say that there is a linear component in the relationship, but it obviously doesn’t tell the whole story. There are various ways to measure the nonlinear component, including such procedures as nonlinear regression. A simple measure of the strenght of the whole relationship is eta.

Eta is a common result in analyses of variance but it can be run for its own sake. The Eta coefficient for Birth Rate x Death Rate is 0.68. Given that the Pearson coefficient is 0.51, we can say that there is a non-linear component also. How much variance is explained by the two components? Let’s look at the coefficients of determination. For the Pearson statistic, it’s 0.26. For eta it’s 0.46. 26% of the variance is explained by the linear component and 46% is explained by both, so the nonlinear component explains 20%, or almost half of the variability is from the nonlinear relationship.

Eta is a ratio that compares the between group sum of squares (which is the between group variance – the variance created by the relationship of the variables) to the total variance of the data. In other words, the percent of the total variance attributed to the relationship, and, as you know now, that is just the R squared.

Another preview of regression analysis.

Both correlation coefficients and regression coefficients are measures of strength of relationship. The correlation coefficient can give you an idea of how strong a relationship is and, if you square it to calculate the coefficient of determination, it can even tell you how much variance in a variable is explained by it’s relationship with another variable, but the regression coefficient goes much farther by telling you how much one variable changes with a specific amount of change in another variable. It is the slope of the line of best fit between variables.

I will be talking much more about the workhorse of statistics as time goes on and you will be so excited when I do!