Regression

Regression – the workhorse

Regress to what?

People used to undergo past life regression – back to an earlier incarnation (do they still do that?) They have a nervous breakdown and go back to their childhood – infantile regression – to escape some trauma. Where do statisticians regress?

Usually to straight lines. Less commonly to some other simple curve.

You see, data is messy. If you graph it, you get points like this.

You can tell that there’s a trend but the graph is all jagged. Can it be simplified? You could draw a line through it to best describe the main direction of the graph – a “line of best fit”.

What about this one?

It looks pretty good but how can we be sure that it’s the best best line of fit? What we want to do is make sure that the distances between all the points and the line are minimized. We have several choices for “distances”. We could draw lines between the points and the trendline so that the connecting lines intersect the trendline at right angles. We could draw lines horizontally or vertically from the points to the trendline.

We could use any of the options but vertical lines give an attractive property. Regression analysis attempts to clarify the relationship between a set of variables – independent variables since they are factors out in the world that we aren’t trying to control – and one (or less commonly, more) variables that are dependent on the independent variables. Usually, the independent variables are set up as the x variables (on the graph), and the dependent variable is set up as the y variable. Since we’re ultimately interested in cutting the difference between the y variable and our predicted values of the y variable, the vertical distance, which is just that difference, is what we want to minimize.

Minimize, maximize, optimize….those are the realm of calculus and, indeed, you need to use calculus to find a formula that minimizes the red lines in the following graph.

And so I don’t scare the nonmathematical folks away, I’ll just show you the formula. What we need is the equation of that best line of fit, and you might remember from high school algebra that all you need to define a line on a graph is the slope and the y intercept. Here they are.

And the formula for the line is y=ax + b. Use the x data value to find the predicted y values. The difference between the values of the dependent variable and the predicted values from the formula for the line of best fit are called “residuals” and are the measure of error in the model. That makes the actual model

y=ax + b + e

Where y is the actual (observed) values of the dependent variable, x is the values of the independent variable(s), b is the value of y when x is zero, and e is an error term (the residuals). Error in the model is assumed to be normally distributed with a mean error of zero. This is important because, when you add up all the residuals, you want them to cancel out. The standard normal distribution is symmetric with negative values as likely as positive values.

Luckily, standard normal errors are very common in nature and if the errors in a particular set of data do not behave, there is usually a mathematical way (called a transformation) that will bring them in line. More about that later.

So, that’s why people use computers for regression analysis nowadays. I’m old enough to remember doing it by hand for very (!) simple problems. If you know much about linear algebra, there are matrix formulas that are a lot easier than the monster above.

The regression philosophy

I’ve called regression a “workhorse” and it is the immediate go to for many researchers. For instance, if you want to compare groups to see if they look statistically the same, you can use something like a t-test or ANOVA. But you can do an ANOVA with regression analysis and you end up with a lot more juicy information, like how you can predict the values of one set of variables from the other variables in the model.

Regression is a different way of looking at data than hypothesis testing with p values or confidence limits. I’ve used the word “model” several times above and it is at the very center of regression philosophy.

When you use regression, you look at your data (exploratory statistics) and you come up with a model you think might explain what you see. In other words, you think about what things in the universe might work together to cause a variable of interest to behave the way it does. The things “out there” in the world are the independent variables, and the variable (or variables) of interest is the dependent variable(s).

At base, regression is a descriptive methodology. It is used to describe the relationship between variables. Remember the three kinds of relationships that can exist between two variables. If one variable consistently increases with another variable, they are said to have a positive (or direct) relationship. If one variable consistently decreases as another increases, they are said to have an inverse, or indirect, relationship. Otherwise, one variable can be totally out of sync with another variable and, in that case, there is no relationship. Regression describes these kinds of relationships.

The slopes I mentioned above are also called “regression coefficients” since they are the coefficients of the regression equation. Think about what they do. Any slope is “rise over run”. If x is an independent variable and y is a dependent variable, then the slope is change in y per change in x. In other words, the regression coefficients for x tells you how much y changes (on the average) with a change of one unit in x. The other coefficient, the “y intercept” tells you the value of y when x is zero.

These coefficients describe the strength and direction between two variables (a negative coefficient indicates an inverse relationship). You can also backslide a bit and ask if the relationship you are seeing is statistically significant. There are p values and confidence limits associated with regression coefficients and regression equations. They’re usually generated by regression software and they are interpreted in much the same way as p values and confidence limits in tests of hypotheses.

Simple linear regression

The simplest scenario in regression is one in which a single dependent variable is regressed on a single independent variable.

In the section about correlation, we looked at data that described several variables in countries across the world that related to health, and birth and death rates. Let’s look at death rate and see if we can use it to predict infant mortality rates.

How it works

First, let’s look at a scatterplot of death rates (the x – independent variable) and infant mortality rates (the y – dependent variable).

I’ve added a trendline, linear regression equation and R² (coefficient of determination) value. The last number gives an idea about how strong the linear relationship between the two variables is. It ranges from 0 (no relationship) to 1 (perfect relationship) and 0.73 is not too shabby. It means that 73% of the linear variation in infant mortality rates is explained by death rates.

LibreOffice Calc has an array function called, “linest” that will do a no-frills linear regression on data arranged in columns. It just presents the numbers, so the results need to be relocated with labels for presentation. Here are my results.

The equation on the chart and the slope and intercept in the table agree. The equation for the model is infant mortality rate = 2.9 (death rate) – 30. You can predict infant mortality by multiplying the death rate by 2.9 and adding 30. There will be some deviation from the results because real data is noisy.

But what are the other numbers?

We’ve talked about standard errors before. Regression coefficients have those, too, and they can be used to set up confidence limits. You want coefficients that are not zero. A zero coefficient means that that variable has no effect on the dependent variable. The 95% confidence limits for the slope are 1.96 times the standard error on either side of the observed slope – 2.9+1.96*0.18 to 2.9-1.96*0.18 or 3.2 to 2.5. 0 isn’t included in that interval, so we’re safe there.

The intercept can be zero if it wants, so I’ll just ignore that for now.

The f test provides a measure of statistical significance for the whole model. The resulting p value is 0.051, and, if we use the usual 0.05 cutoff, we are not justified in rejecting the null hypothesis that the one independent variable we have actually affects the dependent variable (given the size of our sample.) It is close, though, and regression analysis provides us with a lot more diagnostic information. The confidence interval looks good as does the coefficient of determination. Let’s look at the residuals.

A residual is the difference between an observed value and what would be predicted by the model. It’s a measure of the error in the observed values. For instance, the observed infant mortality rate for Albania is 30.8. with an observed death rate of 5.7, given the regression equation, we would expect an infant mortality rate of 2.9*5.7-30 or -13.1.

That’s not very close but remember that we’re doing a juggling act trying to find an equation which will reduce all the residuals at the same time. If you look back at the scatterplot, you’ll notice that many of the residuals, especially at the right side of the plot, need a lot of reducing.

Let’s plot the observed values to the predicted values.

There’s a fairly tight linearity here but there are also outliers just as there are outliers in the data. Also, these rates should not be negative and negative values are predicted.

A chart of residuals graphed to observed values should just look like a shapeless blob. Residuals represent error and we want our error to be random and normal. Here, error increases as values of infant mortality increase. You can see the effect in the flaring of the data scatterplot.

Let’s look at the histogram of the residuals.

That’s not too bad. It looks roughly normal and symmetrical. These residuals might balance out around the mean.

Actually, this model could use some work. Not enough variance is explained by the one variable (death rate). There are some outliers that are pulling the regression line off, and the residuals are scattered unevenly.

Maybe if we add another variable, but can we do that?

Multivariate – no problem

Yes, of course we can add another variable. In fact, let’s add the gross national product to the mix. The GNP is a sum that takes into consideration what wealth a nation makes and ignores what other nations make off that country. It is seen as an indicator of a country’s economic health. Maybe a country’s health impacts the health of it’s infants.

Here’s the results when we try to predict infant mortality using both death rate and GNP.

The suggested regression equation to predict infant mortality is:

Infant mortality=-0.002(death rate) +5.39(GNP)+11.7

It’s not that surprising that, as death rate increases, infant mortality decreases, but in this equation, death rate doesn’t seem all that influential. But, with a standard error of 0.0003, it’s still significant (the 95% confidence interval is between -0.002 and -0.001. There’s little chance that the regression coefficient is actually 0).

But look at the coefficient of determination, 0.63!. It’s less than when we just included the death rate. What went wrong.

Let’s look at the scatterplot of GNP vs. infant mortality.

Grph! That’s certainly not linear. Linear regression works with linear relationships! This is a problem. Is there any way to solve it?

Nonlinear – no problem

Yes, of course there is.

Notice that a power function, 2275.8 x (GNP)^-0.56, nicely represents the infant mortality rate. Maybe, if we apply a logarithmic function to the GNP, it will straighten the relationship with infant mortality out into a linear relationship we can use. Let’s graph the natural log of the GNP to infant mortality.

A coefficient of determination of 0.81…not too bad! So, what happens if we regress infant mortality on both death rate and log(GNP)?

Nice. The regression equation is:

Infant mortality=-18.6(death rate)+3.6(LogGNP)+141

And p value is far less than 0.05, indicating a significant result. The R square of 0.71 isn’t too shabby either… wait. The R square for predicting infant mortality with just the death rate is 0.73. This model isn’t even as good as the single predictor model! What’s going on?

Problems – diagnostics and fixes

Linear regression makes several assumptions and there are several ways to test for and address violations of those assumptions.

Linearity is easy. Linear regression really is linear and variables really should have a linear relationship if you are going to use linear regression on them.

It’s easy to tell if variables are linearly related. Look at a scatterplot of each pair (we did that on the correlation page with a scatterplot matrix) and if the points cluster around a straight line and there’s a large, say 0.8, coefficient of determination, then the relationship is linear.

If it’s not linear, then it can be put into shape with an appropriate transformation, just like we did above with the curvey relationship between GNP and infant mortality. And as you’ll see later, if a relationship is nonlinear, you can switch to a nonlinear regression.

Normality of residuals. I’ve seen too often, even in “scholarly” works that the variables in a linear regression have to be normal. On the contrary, linear regression is very accepting of departures from normality. What it doesn’t like is errors that won’t balance out. As discussed above, the residuals in a regression analysis should be normally distributed (or, at least from s symmetric distribution) so that each error on one side of the regression line will cancel an error on the other side.

You won’t get measurements without error but if all your errors cancel out, that’s the second best thing.

We checked the model that predicts infant mortality using death rate for normality of residuals. It was easy. We just looked at the histogram of the residuals. We could have used a variety of tests of normality…Shapiro-Wilks, Komolgarov-Smirnov, a rankits plot….

Homoscedasticity of errors: This is something else that can cause errors to not cancel out.

“Homoscedasticity” is a big word that means “same variance”. And that’s understandably confusing. The usual explanation is that each data point should have the same variance, but a data point doesn’t vary, does it?

The point is difficult.

Data points vary in relation to each other, in relation to time, and, especially in relation to that straight regression line.

Let me approach the same issue from a different direction.

Error is the things that are going on in the world that has no direct beating on what your measuring. Measure the length of a metal rod and come back in an hour and measure it again and it will be slightly different because….well, because the temperature changed, or the barometric pressure changed, or it oxidized a little, or a butterfly in South America flapped it’s wings….who knows! Some little something changed somewhere. That’s error.

Is it important? Why, yes. Of course it is. You might not know every reason why the segments of a bridge change shape, but you had better make allowances for it. And you’d better have at least a statistical understanding of how it changes shape over time.

By definition, error is the result of random things happening in the environment. Error should not be related to what you’re measuring. Error should not be correlated with what you’re measuring. In other words, error should be independent of any and all the variables in a regression model (except, see the comment on Demming regression below).

If an error factor is significantly related to, especially, the dependent variable, then it’s not error and it’s a variable that should be included in the model.

The variation of error of a data point is in relation to the other variables.

The opposite of homoscedasticity is heteroscedasticity. How do you test for it?

The easiest way is to just look at it. Error is the residuals. What you don’t want it to be related to is a variable, and especially the variable you’re trying to predict, so just graph the residuals to the dependent variable. There should be no discernable pattern. It should be a formless cloud around the y=0 line.

One of the more common patterns is a tapering cloud that indicates that, as the dependent variable increases, error diverges or converges on a line.

If your model does have a strong homoscedastic trend, there are transformations, called “variance-stabilizing transformations”, that can be applied to your data to reduce the tendency.

Independence of observations: The technical term is “autocorrelation”. The values in a variable should not strongly influence each other and the errors should most certainly not influence each other. It’s a pretty tall order in a universe where everything is connected but you don’t want a strong relationship between observations, otherwise, that’s an error that should be another variable.

A way to catch this culprit is to number the cases and graph the numbers to the residuals. There shouldn’t be a discernable pattern.

Outliers can be trouble: like many statistics, extreme values pull the regression line off. And like most data, outliers are obvious on a scatterplot. You can deal with outliers by either deleting them from your data (my suggestion is to analyze your data with and without the outliers – they might be important) or by using one of the regression procedures designed for that problem, called “robust regression”.

No collinearity: Collinearity is when some of the explanatory variables are highly correlated with each other.

Think about it, regression analysis exists because independent variables say something important about dependent variables and regression analysis uncovers what they are saying. If two independent variables are strongly correlated, they’re basically saying the same thing about the dependent variable. Saying the same thing twice doesn’t help. In fact, if one variable says the same thing better than another variable and you throw the worse variable into the mix, you’re just diluting what the better variable is saying. The second variable is just noise

You can easily see how correlated independent variables are with a correlation matrix (like the one we looked at in the StatFile on Correlation for this very data.)

The best way to handle this problem is to check different models using different variables and, if adding a variable reduces the effectiveness of the model (and there are several measures of the effectiveness of a model including R squared and f values), take that variable out of the model.

The problem with this data

Do the data meet the requirements for a good regression analysis? Let’s see.

Are the relationships between the variables linear? Some are, but, for instance, the relationship between gross national product and infant mortality is definitely not linear. That can be dealt with and a transform, squaring the GNP, straightens it out nicely. Let’s give this data a A- on linearity.

Are the errors normal? It’s debatable but, all things considered, they seem to be normal enough to cancel out in the total, so that shouldn’t be a problem. Hmmm… let’s give it a B- for normal errors.

Homoscedasticity…I have a residual plot for the two-variable model. Here it is.

There’s a definite pattern. As death rate increases, the residuals increase in the negative direction. That’s a problem because it indicates that there is something that we are not measuring that should be in the model.

Are the individual data points affecting each other. Here’s a scatterplot of the residuals to the case numbers.

Again, there is a pattern, the classical flaring pattern. Residuals increase in a negative direction as case numbers increase and, since the list groups countries on continents, we can assume that geographic location ties countries together in relation to infant mortality or the recording of infant mortality. What’s at the bottom of the list? Africa.

There are variance stabilizing transforms like logarithms and Box-Cox transforms that can be used to reduce this effect, but let’s check the last two assumptions and see if they would actually help.

Outliers…There are a couple of obvious outliers but we found those to be fairly innocuous and decided to leave them in the data set.

Collinearity….I saved the best until last. Here’s that correlation matrix for our data.

Just about the only variables that show a weak correlation with all the other variables are geographic region, GNP, and death rate (which is mildly related to birth rate, but by now we know that even the weak relationships only look weak when checking for linear relationships. When we look at nonlinear relationships, even those variables are strongly related to everything else.

This data simply does not give a statistician much to work with for developing a predictive model. About the best they can do is use one predictor. When predicting infant mortality using death rate as the predictor, we can account for around 73% of the variation in infant mortality by looking at the overall death rate of a population.

That’s really not too bad, but it means that there are other factors that account for the other 27%.

If you use the square of the GNP, you improve the prediction to 81%

Special regressions

There are many alternative regression procedures to deal with special data. For instance, some computer programs will look at each variable in a data set to see if it adds anything to your model. If it doesn’t, it will remove it for you. That’s called hierarchical regression.

There is actually a way to deal with the kind of collinearity present in our Countries data above. Ridge regression increases variability between variables by adding a little bias to them. Bias is generally a bad thing but here, it’s a trade off…variability vs. bias.

LASSO (least absolute shrinkage and selection operator) is another regression strategy, related to ridge regression with the addition of variable selection, that can be used to increase precision, for instance, in the case of collinearity.

Generally, you need around 20 cases per predictor variable to do a regression analysis. There are ways to deal with small samples. Partial least squares regression reduces the number of variables using something like Principal Component Analysis and performs the regression on the components.

If errors vary significantly between the predictor variables in a model, the individual errors can be taken into account using Demming regression.

There are several versions of robust regression that can be used when there are troublesome outliers involved. Often, robust regression uses an iterative technique to identify outliers and measure how much they pull the regression line off, then they can be weighted to reduce their influence.

Linear regression can handle most kinds of independent variables – counts, rankings, categories, measurements, or even independent variables that have nonlinear relationships with the dependent variable(s, and, yes, there can be more than one dependent variable in a linear regression model). Complications arise when the dependent variable is not continuous or not normal. For those cases, there are more special regression procedures.

If you know how the dependent variable is distributed (Poisson, binomial, negative binomial), there is likely a well developed regression technique out there, tied up neatly in a software package, for you.

If the dependent variable is nominal or ordinal, logistic regression can be used. There are procedures for binary (two-valued), nominal (categorical), and ordinal (ranked) dependent variables.

I haven’t even scratched the surface. The take-away is that, if you have data that doesn’t quite fit the usual regression procedure, don’t despair, there is a regression technique out there for you.

Regression and ANOVA – the connect

We talked a little about ANOVA back when we were talking about hypothesis testing. ANOVA stands for “analysis of variance” and that is exactly what it is. It determines if groups are different by seeing if their variances are significantly different. Variances, in this case, are measured by sums of squared deviances of data points from their means.

We’ve talked about a lot of deviance in this discussion. That’s what a residual is, the deviance of a predicted score from an observed score. If you look at the formula for a regression coefficient back at the top of this page, you will see a lot that looks like sum of squared deviations.

So it shouldn’t be too surprising that many programs that do regression analyses also give you ANOVA statistics. Once you have generated all the statistics for regression, it’s nothing to shuffle them around to get an ANOVA table. You get two for the price of one.

But wait! There’s more!

When you’re using matrices to calculate the statistics, regression is a lot easier than traditional ANOVA. As an extra bonus, whereas traditional ANOVA techniques require raw data to be categorical predictors for continuous dependent variables, you can set up regression models for just about any kind of data.

General linear models – just to let you know

Matrix methods to calculate regression models, ANOVA, and ANCOVA are the same so it’s very convenient to just lump them all together and call them “general linear models” or GLM.

ANCOVA stands for “analysis of covariance” and includes corrections for highly correlated variables. It’s a way to control for variables that you don’t want affecting your results. Perhaps you know that gender has an effect on bronchitis but you want to check the effectiveness of a drug without the “noise” of gender, or maybe you want to know both, you can use ANCOVA to cancel out the effects of gender.

Generalized linear models

So, while we’re generalizing why don’t we just add non-normal variables and some Bayesian methods. After all, the normal distribution is a member of the exponential family of distributions, so we could just expand to an exponential regression procedure. Here, things get very complicated but once someone packages it all nicely into a computer program, the user doesn’t have to worry about the complexity, they just have to know how to interpret the results.

But with generalized linear models I’m just about as general as I want to go with this general introduction.

Approaching nonlinearity as nonlinear

We’ve been talking about linear regression. What if the relationships among the variables are not linear and you can’t find a transformation to make them linear?

You might remember a point in algebra where you flipped the question, “here is a function. what does it look like on a graph?” around and asked, “Here are some data points. What does the function that generated them look like?”

That’s called curve fitting and you can do the same thing in statistics with nonlinear data to describe relationships. The result is nonlinear regression.

If you have a spreadsheet at your disposal, like, maybe, Calc, you might have noticed that you can put trendlines on scatterplots. You might have also noticed that you have options other than “linear” for those trendlines. Those options are nonlinear regression.

Now, one last honorable mention before we leave regression, sort of….

Causal analysis

Correlation isn’t causation (for the thousandth time), but neither is regression, and yet, you can trade out causation structures using partial correlations coefficients.

There is such a thing as partial regression coefficients and you can do the same sort of thing with them. In fact, there are some mindbendingly complicated computer programs like structural equation modeling software (LISREL was a famous early package) that does just that.

So this is just a taste of the vast potential of regression and why it is the workhorse of statistics.