Analyzing nominal data

What is nominal data, again?

There are two kinds of data – continuous and discrete. If, for any two values, you can find a value between them, then the data is continuous. Otherwise, it’s discrete.

Again, there are two kinds of data – qualitative and quantitative. Quantitative data are measurements – temperature, length, wavelength, concentration. Quantitative data tends to be continuous. Qualitative data are attributes – gender, color, texture, composition (as in paintings and photographs). Qualitative data tends to be analyzed by counting instances in different categories. Counting produces discrete data.

There are two kinds of discrete data. If there is some relevant order to the data: smaller to larger, better to worse, alphabetical order, spectral order – then the data is ordinal. Otherwise, it’s nominal.

Nominal data is discrete, qualitative data without any relevant order. Nominal data are purely counts. The only thing nominal data tells you is how many individuals occur in a number of specified categories. Yet, nominal data provides useful information and there are many statistical procedures that are designed specifically to work with nominal data.

Here is a simple example that I will use to illustrate several important principles of nominal statistics.

I made a there-and-back tour of a local trail called Bear Creek Trail. Going, I greeted people without smiling. On the way back, I smiled at them. Are smiles infectious?

Counting stuff

I recorded the number of people I met on the trail and I recorded whether they smiled at me or not. There were two variables. One was whether I smiled or not. The other was whether the people I met smiled or not. Both variables were dichotomous – there were two categories for each.

I counted by recording tally marks in a notebook.

There are actually several important counting methods and there are several instruments that can be used for recording counts.

A hand clicker is a common instrument for psychometrists, psychologists, and quality control technicians. It’s a simple tool with one button that can be pressed to advance a count by one. There’s only one other control, a knob that can be turned to return the count to zero.

Another common, simple instrument is a timer. Start it and it begins to tick off the seconds. It can also be reset to zero. Sometimes, it can be set to a specific time interval and it will count back to zero second at which time, it sets off an alarm.

Most calculators can be used as counters. Enter “1+1” for the first count and “=” for the second. Then, every time you press “=”, the count will advance by 1.

There are counter applications that can be loaded into a smart phone that works just like the hand clicker.

I have created a spreadsheet document to serve as a repository for tools that I program (aptly named ToolBook). It will record both simple and timed counts in a spreadsheet format. You can download it for free from the LabBooks page of the Therian Timeline.

Watches today have a variety of functions. Most watches with timer functions will count up or down and will take a series of times (or “laps”). Typically, they will give you an auditory notification at a particular time or when a countdown is finished.

A count can be as simple as….well, counting. How many males and how many females do you have in a group. Just count them.

A little more involved are time interval or frequency counts. For those, you set a timer to let you know when a specific time interval has elapsed, say, 5 minutes, and you record counts within the 5 minute intervals for a set number of intervals. You could check traffic flow at a particular place by counting the number of cars that pass during 5 minute intervals for an hour.

Interval counts time how long it takes for a set number of occurrences of an event to happen and records the time or series of times.

Time sampling isn’t about how many occurrences happen during an interval but how many occurrences happen at particular times. This is useful when you want to get an idea how often a behavior occurs but you don’t want to watch constantly – perhaps you have to be doing other things while you’re counting. For that, you arrange to observe at the end of every time period, say, every half hour, and count how many people are engaged in the activity or behavior. Begin by counting the number of subjects involved and record the count at the end of every time period.

This is also called a Planned Activity Check (or Placheck).

A timer spreadsheet like those in ToolBook gives you the advantage of saving repeated counts or times in a column of cells so that you can use them for further analysis.

Visualizing a nominal variable

Frequency counts are counts so it shouldn’t be any surprise that the most popular way to visualize other counts is a bar or column chart, or line graph.

In general, if you are graphing a series of counts over time, a column chart emphasizes the time element more than a bar chart. A bar chart is more useful when you have a lot of counts to display.

Here is a column chart display of the smiles data.

It certainly looks like smiling is infectious but we have the same kind of question about significance that we had when we were looking at continuous data. Is the strength of the relationship strong enough to assume that the apparent difference isn’t just caused by random error?

There are other things suggested by this picture. There seems to be a greater discrepancy in peoples’ behavior when I don’t smile at them than when I do.

Nominal descriptive statistics

If you aren’t already pretty intimate with statistics, you might be surprised at what you can get out of mere counts. Whole books are written on nominal statistics. But since counts are all you have, there isn’t much to work with. Descriptive statistics don’t tend to be terribly interesting. In our data, when I smiled at people, 48 smiled back and 29 didn’t. More smiled back than didn’t but what does that mean – more people had a good morning that day? When you notice that, when I didn’t smile at people, only 20 smiled back but 60 didn’t. There definitely seems to be a pattern there but what does it mean? What can you do with counts?

An arithmetic average isn’t very useful. When I smile at people, 48 smile back and 29 don’t. The average is 38.5…what? Smiles? Well….that’s half of the people sampled. So what?

Actually, with counts, one of the other means is much more appropriate than an arithmetic average. The mode of the group I smiled at is 48. That is meaningful. It is the typical category. The largest group of people in the sample are those that smiled back. In the other sample, the mode was 60 – the people that didn’t smile back. What was the overall mode? The 60 who I didn’t smile at and who didn’t smile back at me.

One would be tempted to say that, by nature, people need prompting in order to not be grouches, but wait. The group I didn’t smile at was larger than the group I did smile at. Remember, I walked westward and smiled at people as I went and then turned around and walked back, not smiling at people. I had to take what I got, so the samples were of different sizes. To even the groups out, we can look at the proportions.

Of the group I smiled at, 62% smiled back. Of the people I didn’t smile at, 75% didn’t smile back. In other words, if there had been 100 people in each sample, this data indicates that, of the group that I smiled at 62 would have smiled back and 38 would have been grouches. Of the group that I didn’t smile at, 25 would have smiled at me anyway, and 75 would have followed suit.

That doesn’t look as earthshaking. Maybe it was just random influences. We’ll be testing that later on but you can see how nominal descriptive statistics: counts, modes, proportions, maximums and minimums, ranges – can be interesting.

Two nominal variables – proportions revisited

On the last page, we talked about a test for the difference between two proportions. Did you consider the idea that that test could be used to test the proposition that smiling is contagious? Let’s phrase the null hypothesis like this: “There is no difference between the proportion of people who smile when I smile at them and the proportion of people who do not smile when I do not smile.” The proportions of “I smile and they smile back” to all the people I smile at is 62%. The proportion of people that I did not smile and did not smile back to all the people I did not smile at is 75%.

I obtain a p value of 0.09 which is too high for me to confidently reject the null hypothesis.

Two nominal variables – association

One problem is that this test doesn’t capture the complexity of the situation. It treats the two groups “I smile”/”I don’t smile” as two different unrelated processes, which, of course, they aren’t. We need a test that looks at the whole situation. There are two variable, each having two states.

In fact, if you test the difference between the proportions for “They smile when I smile” (which is 0,25) and “they smile when I don’t smile” (0.62), you get a very different story. The p value is 0.0000024. That would be significant in anybody’s book.

Are the two variables linked – the term is “association” – are the two variables associated?

Visualizing two nominal variables

There is another kind of graph that can give us a better idea of what’s going on here – the 3D column chart.

Here, the “contrary motion” is pretty obvious. When I smile, they are likely to smile back, and when I don’t smile, they are more likely to not smile back. How can we quantify this relationship?

Let’s look at a table of these values.

Contingency table – joint frequency and marginals

The four northwestern cells in the table collect the counts. “Cell 1,1” means the cell in the first row and the first column. That cell collects the number of cases in which I smile and they smile back. Since we’re talking about one category from both variables, this is a joint frequency.

A joint frequency is the count of overlapping categories.

The cells along the east and south sides of the table are totals. For example, cell 1,3 contains the number of people that I smiled at. These are called marginals (row and column marginals) and they tell how many cases are in each of the categories. 77 of the cases are in the category “I smile”.

Notice that this table is set up so that the independent variable – the variablle that I’m manipulating in the experiment – takes the rows. The columns contain the category count for the response (or dependent) variable.

The cell in the southeast corner of the table is the grand total. It is the number of subjects that took part in the experiment.

A table set up like this is called a “crosstabulation” or a “contingency table”. It is a visual representation of how the categories of the variables are contingent on each other.

This is the beginning of the analysis of any two nominal variables.

Pivot tables

In most studies of this kind, a record book will have a column for each variable and the categories are coded.

Column 1: “I smile”=1, “I don’t smile”=0Column 2: “They smile back”=1, “They don’t smile back”=0

The researcher ends up with a column of case numbers and two columns of 1s and 0s.

Case Stimulus Response 1 1 0 2 1 1 3 0 0….

An easy way to do that is to just take a laptop, tablet, or smart phone with a spreadsheet app with you and record the observations as you go. But then you have to transform the raw data into a contingency table.

Luckily, most modern spreadsheets have a utility called a pivot table. If your app doesn’t have a pivot table, you can transfer the raw data to a spreadsheet that does.

You can expect some small differences between pivot tables but the general idea is that you feed it the raw data and it does all the counting for you and arranges the numbers into the contingency table form I show above.

Pivot tables can also work with continuous data, give you summary statistics and group multiple variables pretty much however you want.

Joint probabilities – percentages and odds (difference between I smile and I don’t smile)

Counts are not the only things you can get from a contingency table (not even close!) By dividing cell counts by marginals, you can describe all the probabilities, but as I hope you noticed in our look a proportions above, you have several kinds of probabilities and you have to be careful which you are using.

Each row of a contingency table displays the counts in one category of the independent variable (if the table is set up according to convension). Each cell contains the count for one category of the dependent variable within the independent variable. The row marginal contains the total count for the category of the independent variable. If you divide a cell count by that row total you have a fraction with a particular part of a category for the numerator and the total count for the denominator. That should be a familiar setup – it’s a probability.

Here are the row percentages for our smiles data.

Again, the independent variable is whether I smile or not. It is a dichotomous variable – it has two categories: I smile and I don’t smile. If you put all the observations into one bag that I collected by smiling at people and recording their responses, and drew one out at random, what is the probability that it will be one of the people who smiled back?

There were 48 people who smiled back out of a total of 77, so the probability is 0.62 or 62%. And, since this is a dichotomous variable – they either smiled back or they didn’t, so the probability that they did not smile back is 1-0.62=0.37. (A probability of 1 indicates certainty so, since it’s absolutely certain that they did one or the other, the sum of the two probabilities has to be 1.)

The row probabilities are the probabilities that, given that one of the categories of the independent variable is true, one of the categories of the dependent variable is also true.

The column percentages are the probabilities that, given that one of the categories of dependent variable is true, that one of the probabilities of the independent variable is true. For instance, of the people who didn’t smile back at me, what is the probability that I smiled at them? That would be cell 1,2. If a person did not smile at me, there’s a 33% chance that I did smile at them.

The total percentages are the probabilities that a single observation is in a particular cell of the contingency table. They are cell counts divided by the total number of observations. So, the probability that I didn’t smile at a person but they smiled at me anyway is 0.13 or 13%.

In some fields, particularly in medical research, cell probabilities are called risks. You can see why if you think of the indipendent variable being exposure to a virus and the dependent variable being coming down with the disease. The risk is the chance that, being exposed to the virus, you will come down with the disease.

Sometimes, statisticians talk about the odds of something happening. It’s similar to the gambling concept of odds but not quite the same thing. In gambling “odds” compares stakes. In statistics, they compare counts.

Odds are used almost exclusively for describing dichotomous data. Where a probability uses a fraction to compare a part to the whole, odds compare one part to the other part.

So the odds of someone smiling back at me when I smile is 2.4.

Notice that, if two events are equally likely, the odds of either of them happening is 1. If the probability of the most likely event is in the numerator, the odds will be more than one. If the probability of the least likely event is in the numerator, the odds will be less than one.

We will be looking at odds and risks later, partialy because of their importance in medical statistics, but also because odds form a basis for regression analysis for nominal data.

Residuals

The distance of an observed value from an expected value is called a residual. Residuals are usually thought to be measures of random error. The philosophy behind this is that science assumes relationships to be simple. In a relationship without error, if you graph the data points, you will see a line or curve. What you actually see is a cloud of points that more or less follow a line or curve.

You’ve seen several formulas containing the sum of the differences between data points and the group mean. Remember that the group mean is also the expected value for that mean. So these differences are residuals.

What are the expected values in a contingency table?

We want the values for the cells that we would expect if they are entirely random given the same marginal values. We can use the marginal values to find the probability that a random observation will be in the row of a particular cell by dividing the cell count by the row sum. We can also find the probability that a observation will be in the column of the cell by dividing the cell count by the column total. To find the probability that an observation will be in a particular row and a particular column (e.g., a particular cell) we just multiply these two probabilities. That means that a probability that an observation is in a particular cell can be calculated by:

(Row count)(cell count)/Total squared

But we don’t want a cell probability. We want a cell count, so, to get what we want, we just multiply the probability by the total. A total value cancels from both the numerator and the denominator and the formula for the expected value of a cell count becomes:

(Row count)(cell count)/Total

If we go through our contingency table and calculate all the expected cell counts, it will look like this.

The residuals will be the difference between the observed cell counts and these expected cell counts.

Oops. All the residuals canceled out. That won’t tell us much, but we’ve seen this before and we know how to prevent it – either take the absolute value of the residuals or the square. It turns out that the square “behaves better” statistically so here are the squared residuals.

This is a symmetric table: the northwest and southeast cells are large and the northeast and southwest cells are small so it really shouldn’t surprise you that the diagonal and antidiagonal cell residuals cancel out.

Residuals and squared residuals can’t be compared between contingency tables because, although they tend to follow normal distributions, the means and standard deviations tend to be different. We’ve also seen this before and what we’ve done with other statistics is standardize them by dividing the residuals by the standard deviations of the differences – that is the square root of the expected counts.

For large samples, standardized residuals are z scores. All standardized residuals have means of 0 and standard deviations of 1 so they can be compared. In addition, they represent the strength of the relationship between the observed and expected cell values. They make it easy to see which cells are contributing most to the variance in the data.

As a general rule for interpreting standardized residuals, if the standardized residual is less than -2, the cell’s observed frequency is significantly less than the expected frequency (because it’s less than two standard deviations different). If it is more than 2, then the observed frequency is significantly larger than the expected frequency.

Looks like we have some significance, here.

There is also an adjusted standardized residual. Contingency table statistics are often affected by the size and shape of the table. Your research questions have nothing to do with the size and shape of your contingency tables, so these hypersensitive statistics are often calmed down with “adjustments”. Here, the expected values in the denominator are “re-proportioned” by multiplying them by 1-row proportion to the total and 1- column proportion to the total.

Here are the adjusted residuals of our smiles data.

What’s a chi square?

Another way to standardize the squared residuals is to just divide them by the expected value. Here are the chi squares of our smiles data.

Here, terminology is a bit scrambled because these values are called “chi square” because they are used to calculate a chi square test statistics, not because they follow a chi square distribution. In fact, they tend toward normality. But they do indicate the relative contribution to the variance.

The importance is the test they are used to perform.

The chi square tests

A chi square distribution arises when several squared standard normal deviates are summed. A standard normal deviate is a random sample from a standard normal distribution. For example, a cell chi square is a standard normal deviate. The parameter that specifies a chi square distribution is it’s degrees of freedom, which is also it’s mean. The degrees of freedom is just the number of deviates being added together.

There are several important statistical tests based on the chi square distribution. One is the simgle sample chi square test used to test the difference between observed values in a sample and their expected values. This provides a popular test of goodness of fit, described below.

Another is a test of association of variables in a contingency table. If you add the cell chi square values together (that would be the grand total in our chi square table above), you end up with a test statistic that follows a chi square distribution with (R-1)(C-1) degrees of freedom, where R is the number of rows in the table, and C is the number of columns. Here, the chi square value is 22.28 and the degrees of freedom is (2-1)(2-1) or 1. The null hypothesis is that there is no difference between the expected and observed counts in the contingency table. A p value can be found in a table of chi square values or with a chi square distribution function on a spreadsheet. Here, we get a p value of 0.0000024. That’s much smaller than we really need. The null hypothesis can be safely rejected.

You might see other chi square test results. A common one is a maximum likelihood chi square that uses information theory to come up with a statistic. Although the two chi squares are usually a little different, the outcome (significance) will always be the same.

What does the chi square test tell you? If there is a significant relationship indicated, it only tells you that there is at least one cell in which the observed count is significantly different than the count that would be expected by pure chance. It doesn’t tell you how many or which cells have a significant difference, just that one exists. Ferreting out the actual culpritn takes a little more work.

Fisher exact test and Yates’ correction

The chi square test is inappropriate if any cell of a 2×2 contingency table has less than 5 observations in it. For such cases, the Fisher Exact Test works well. It is called that because it was invented by Ronald Fisher and it gives an exact p value calculated by combinatorial methods instead of an estimate from a chi square distribution.

The Fisher test can be used for any contingency table but it is a real pain to calculate by hand and exponentially more so as the size of the table increases. Luckily, the chi square test is fine for tables larger than 2×2.

For a 2×2 table, the formula

will give a result that follows an exact hyprgeometric probability of seeing this particular arrangements of counts. a is the count in the northwest cell, b is the count in the northeast cell, c is the count in the southwest cell, d is the count in the southeast cell, and n is the grand total.

Many statistics packages will calculate this test statistic and its p value, but most will only do it for a 2×2 table where at least one cell count is 5 or less. A correction for this situation was developed by the English statistician Frank Yates. It subtracts 0.5 from each cell residue, reducing the overall chi square value and increasing the p value. The problem is that it sometimes over compensates.

What’s special about dichotomous data?

There are lots of statistics that are designed specifically for dichotomous data – alone and in combination with other kinds of data, the the question is begged, “What’s so special about dichotomous data.”

A dichotomous variable has only two values. That makes them very easy to work with.

The frequency distribution of a dichotomous variable is described by only one number, the number of total observations. Then, if you know one value, you also know the other. If you consider the variable SEX as having two values: male and female (for illustration purposes only), then, if you have a study group of 100 subjects and you know that 25 of them are male, you already know that 100-25 or 75 of them are female.

Binomial data are dichotomous since the two values of a trial are hit/miss.

Dichotomous variables share qualities with both discrete and continuous data. In regression analysis (which we haven’t talked much about yet, but just stick with me – we will) predictor variables (independent variables in other situations) must be either continuous measurements or dichotomous. To perform a regression analysis on a discrete predictor variable, it must be converted to a series of dichotomous “dummy variables” , one to take care of each of the variable values. Dichotomous variables are, in fact, in a data scale all their own.

Goodness of fit, as promised

I’ve mentioned goodness of fit tests several times and have promised to explain them. Well, this is a good time.

Also called “single sample” tests, a goodness of fit test compares observed frequencies to expected frequencies to see if they differ significantly. For instance, the random number function used by LibreOffice, RAND, is supposed to generate numbers between 0 and 1 from a uniform distribution, meaning that every value has the same probability of being generated. Here are frequencies of numbers generated by RAND, scaled up to the range 0 to 9, and truncated to integers.

Here, we have observed frequencies and expected frequencies. Hmmm…I wonder if we can use a chi square test to compare them.

Of course we can! Using the same (expected-observed)^2/expected, I obtain the following chi square statistic: 3.2

The statistic has the degrees of freedom of the number of categories (10) minus 1.

At 9 degrees of freedom, 3,2 has a probability of 0.96. The null hypothesis is that there is no difference between the observed and expected frequencies and the probability that by rejecting that we will make a wrong decision is 0.96. We’re quite safe in assuming that there is no significant difference and the LibreOffice’s random number generator does a very good job.

There are other goodness of fit tests. I’ve mentioned the Kolmogorov-Smirnov test. The test statistic looks at the differences between the observed and expected frequencies and selects the largest. This value follows a predictable pattern and tables are published for the distribution, so the probability of any such value can be found. Using the DANSYS function KSGOF, the resulting probability that there might be a difference is practically 0.

There are a lot(!) of goodness of fit tests. Three of the most popular are the above mensioned and the Shapiro-Wilks test. The square of the correlation coefficient (also called the coefficient of variation) will also tell you how close a regression equation fits observed data – but we haven’t talked about regression analysis yet, so let’s move on.

Association – what do I do with all these numbers?!

Association is a measure of the strength of relationship between two discrete variables. There are many association statistics and the worst thing is that they won’t agree. The reason is that they measure association is different ways and they mean different things. Further, some of them are appropriate for dichotomous data, some for nominal data, some for ordinal data, and some for data on different scales of measurement.

Notice that the chi square test will tell you if there is a relationship between two (or more) discrete variables, but it won’t tell you the strength of the relationship.

So let’s look at a few of these measures of association, their strengths and their weaknesses.

Of the really popular measures of association, there are four types. The simplest ones attempt to modify chi square into something that will give an idea of the strength of relationship between variables. Others provide an indication of how much you can learn from one variable by understanding the other variable. Those are called proportional reduction of error (or PRE) measures. similar score is provided by information theory. And other measures work out the relationship directly using probabilities and they tend to function much like correlation measures.

The phi coefficient is commonly used as an association score for two dichotomous variables. It is the square root of the chi square statistic divided by the number of observations. By now, that pattern should be so familiar to you as to be boring – standardize a test statistics by some version of the number of cases or the standard error. Phi yeilds the same value as the Pearson correlation coefficient estimated for two dichotomous variable.

For our data, we get a phi value of 0.36 and since it varies from 0 when two variables are independent, to 1 for perfect dependence, that is not a bad score. I would call it a mild relationship.

The problem is that phi doesn’t reach a maximum of 1 when one or more of the variables have more than two categories, making it hard to interpret.

The contingency coefficient tries to moderate the problem with phi by dividing the square root of chi square by chi square plus the sample size. The maximum value will always be less than one but will only approach 1 for large sample sizes.

Our data returns a contingency coefficient of 0.34.

Cramer’s V is the most popular measure related to chi square because it will always attain a maximum of 1. In our case, (and any time there are two dichotomous variables), Cramer’s V is the same as phi. It is the square root of chi square divided by whichever is smaller – one less than the number of rows, or 1 less than the number of columns.

If you wonder about the significance of one of these measures of association, they share the p value with the chi square statistic.

Lambda is a PRE measure of association. In the case of the smiles data, it is 0.28. Lambda ranges from 0 is cases of complete independence between two variables to 1 for perfect association. Where it’snot clear what Cramer’s V and the other measures of association develoed from chi square actually means, lambda has a definite interpretation. It is a percentage. It tells how much we can reduce the error in predicting the value of the dependent variable (Whether someone will smile back at me), if we know the independent variable (whether I smile at them.)

Lambda is A-B/A, where A is the number of mistakes we can make in predicting the value of the dependent variable without knowing the value of the independent variable, and B is the number of mistakes we can make if we do consider the independent variable.

In the smiles data, if we were to guess how many people would not smile at me without knowing anything else, our best guess would be all 157, but we would have made 68 mistakes (that’s the number of people that did smile back), so A is 68.

Now, knowing the number of times I smiled at the others, we could guess that the others would make the modal choice (I smile-they smile. I don’t smile, they don’t smile). In that case, we would only make 49 mistakes, so the proportional reduction in error would be 68-49/68 or 0.28. That’s our lambda.

So, we could say that, by knowing how many times I smiled at people, we could reduce the error of predicting whether they would smile back by 28%.

Another characteristic of lambda is that it’s an asymmetric measure of association. DANSYS gives three different values of lambda. The one above is called “Lambda y|x” which means “lambda of y given x”. It is the reduction in error of predicting the dependent variable (y) given knowledge of the independent variable (x). The other two values are the lambda of x given y – 0.36, and a symmetric version which is a mean value between the other two – 0.32.

In addition, DANSYS gives two other numbers for each version of lambda. They are labeled ASE, which stands for asymptotic standard error. You can’t actually calculate a standard error for lambda, but you can do a Monte Carlo type experiment to figure out about what the standard error approaches. ASE0 is the asymptotic standard error assuming that the null hypothesis is true. ASE1 is the standard error assuming that the alternate hypothesis is true.

We have good reason to believe that the alternate hypothesis is true so we can go with 0.07 (the actual values given are ASE0=0.0759892 and ASE1=0.07223151). What can we do with the ASE? We can construct a confidence interval. Remember that to do that for an alpha of 95%, we multiply the standard error by 1.96 and add and subtract the result to the statistic value to get the confidence limits. 1.96*0.07223151=0.14, so the true value of lmabda will be between 0.14 and 0.42. With 95% confidence, our lambda is not 0, so we have a significant relationship between our two variables.

The uncertainty coefficet (also called proficiency entropy coefficient, or Theil’s U) is an information theoretic measure of the association between nominal variables. It can be interpreted as the number of bits of entropy (uncertainty) of a dependent variable that can be predicted given the values of an independent variable. The calculation is rather foreboding so I won’t go into it here, but several statistics packages (including DANSYS) automatically displays the value when analyzing nominal data.

Again, this is an asymmetric measure of association. For U Y|X, DANSYS returns 0.11. U X|Y is only a tiny bit different at the third decimal place. And the symmetric value is also approximately 0.11.

I will say that it is calculated by subtracting the relative entropy of y given x and dividing the difference by the entroy of y. If that looks like the formula for lambda, it’s because they work the same way. The uncertainty coefficient, then is a percentage that explains how much better we can predict values of the dependent variable by knowing the values of the independent variable. The reason lambda and the uncertainty coefficient differs so much is that it’s measuring different things – namely bits of information.

The uncertainty coefficient also yields asymptotic standard errors. For U Y|X, the ASE1 is 0.04 giving a 95% confidence interval of 0.02 to 0.19. That’s not huge but there’s no zero there so we can be certain at the 95% alpha level that the Uncertainty coefficient is not 0 and that there is a mild relationship between “I smile” and “They smile back”.

All these statistics measure linear relationship. In other words, the relationship between the independent and dependent variables can be described by an equation like:

(number of people who smile back)=a(number of people I smile at)+b

a and b being some constant. On a graph, the relationship will look like a straight line. But what if the relationship is not linear? What if it must be described by a quadratic equation, or a quartic equation? The scores we have looked at would not pick that up. Lambda could be flat zero and there could still be a strong relationship.

There is a common statistic that can tell if there is a higher order (nonlinear) relationship. It’s eta. Normally, you get eta from an analysis of variance, but it can be calculated for data in contingency tables and for our data, it is 0.38.

That is very close to our phi, lambda, and correlation values so we can assume that most or all the relationship we see between our variables is linear.

So, correlation….

We’ve talked about correlation (and we will discuss it in much more detail in the next section). It describes the way that two variables change together. If a correlation is negative, it means that when one variable increases, the other decreases. When a correlation is positive, it means that, when one variable increases, the other does also. If the correlation is zero, there is no relationship between the two variables.

Pearson’s product-moment correlation coefficient can be used for both continuous and nomiinal variables. It should not be used for ordinal variables for reasons that we will discuss a little later. We’ve seen that phi is the same as Pearson’s correlation for 2×2 contingency table data. Here, it is 0.38. It could be negative but, for nominal data, the sign doesn’t matter. If you switch the position of the columns in the table, the sign will flip, but it doesn’t mean anything because, for nominal data, order isn’t important (that is not the case for ordinal data).

So, correlation coeffiicients also measure the strength of association between nominal variables.

To apply a correlation routine to nominal data, categories labeled by text should be recorded with numbers. For example, if a variable is the state of residence and the values are the names of the states, they could be recorded with the numbers 1 to 50.

Risk ratios and confidence limits

As I said above, risks are just row or column percentages. Usually row percentages are used since the rows represent the independent variables. In medical statistics, the risk is that, being exposed to a condition (pathogen, treatment, etc.), there will be a specific result (disease, cure, etc.).

Ratios are used to compare values. If the numerator and the denominator of a ratio are equal, the ratio is 1. Ratios greater than 1 indicate that the numerator is larger and ratios between 1 and 0 indicate that the numerator is smaller.

A risk ratio (or relative risk) compares two probabilities. You have to think out what the ratio means from the way that table is set up and how the ratio is calculated.

For our data, the independent variable, “I smile/I don’t smile” is in the rows so the risk for “I smile” is compared to the risk for “I don’t smile.” The risk is that, given what I do, will they smile back. The probabilities again is 62% that, if I smile, they will smile back, and 25% that, if I don’t smile, they will smile back. So, if the null hypothesis is that, it doesn’t matter whether I smile or not, the risk ratio is 0.62/0.25 or 2.48. That means that people exposed to my smile has a 2.48 times greater risk of smiling back.

Of course, you can look at either the relative risk that they smile back, or the relative risk that they don’t. For the latter case, the relative risk is 0.5.

These are relative risks for a cohort study. In a cohort study, a group of people is followed over time to see if, given that they were exposed to a condition, they developed a particular outcome.

There are practical problems with cohort studies. First, if outcomes are rare events, you might have to sample a large number of subjects to find just a few that show the outcome. Also, there is a waiting period before you complete the study. You have to follow a large group of people and the meter is running the whole time in which you will probably lose subjects on the way.

A less expensive alternative is the case-control study. A sample is drawn from the population and classified as to whether they were exposed or not. This kind of study compares the relative sizes of the exposed and nonexposed components. The problem is that you are o longer working with a population, but with a sample and you don’t know the row marginals (since they would be the population sizes). The relative risk for this kind of study is the ratio of the product of the diagonal cells (“I smile and they smile back” times “I don’t smile and they don’t smile back”) to the product of the anti-diagonal cells (“I smile and they don’t smile back” times “I don’t smile but they do smile back”). In this case, our risk ratio is 5. The alternate hypothesis is that they follow my suit so we can say that there is a 5 times greater chance that they follow my suit than that they do not follow my suit.

That’s pretty convincing, but is it significant? We can actually construct confidence limits for our risk ratios (and most stat packages do). The standard error for the case-control study data is the square root of the sum of the inverses of the cell counts. For the cohort study, it is somewhat more complicated.

We are particularly interested in the case-control study here. The 95% confidence interval for that relative risk is 4.4 to 5.6. A ratio of 1 is nowhere near that so we can be very confident that these two risks are not the same and, since the ratio is greater than 1, we can be pretty sure that the chances of people following my suit is greater than that they will not follow my suit.

Odds can also be compared using odds ratios and statistics packages will give you confidence limits and p values for the comparisons. You just have to be very clear about the questions you are asking and the answers your statistics package is giving. There are four row probabilities and four column probabilities for each 2×2 contingency table. The same goes for odds. And statistics packages do not give you all the combinations. You get what you want by feeding the contingency data to the software in the order it needs to give you the correct results.

Dissecting contingency tables

Of course, contingency tables can have any number of rows and columns, and it’s even practical to form tables with rows, columns, and one more dimension – layers – to accomodate three variables. But most of these statistics will only tell you if there is a relationship and how strong that relationship is. It’s possible that, in a 4×5 table, only 4 cells are related. How do you tease out where the reationships actually are?

You can take a complex table apart and look at the different sections, but there are specific rules for doing so. Here they are.

There are two kinds of cell in a contingency table:

A frequencies are either cell frequencies or the grand total.

B frequencies are the marginals.

Any frequency in the original table must occur once and only once in one of the subtables.

Each frequency that is not in the original table (for instance, two columns collapsed into one or a new marginal) in a subtable, must appear as a frequency of the other type in another subtable.

In this way, you can look at different parts of a complex table, perform all the same tests of association on them, and see which parts show the strongest relationships.

A simpler way to dissect a contingency table is just to look at the individual cell chi squares. These are like residuals in a regression analysis that show which data points show the greatest divergence from what would be expected from pure chance.

Looking at the chi square table above for our data, the highest values are in the “They smile back” column. Whether I smile or not affects if they smile back a lot more than if they don’t smile back. In other words, “They don’t smile back” seems to be the default state. Keep in mind that, if the observed count is the same as the expected count, the cell chi square will be 0. So all the cells diverge considerably from the expected values.

Ordinal numbers give you a little extra

Pearson’s product-moment correlation coefficiient, technically, is designed to explore continuous data that has normally distributed residuals, and it detects linear relationships. Technically, it is not designed to use with nominal data and association measures are more appropriate, but as you have seen, the Pearson statistics agrees well with the phi coefficient for 2×2 tables and it will agree well with Cramer’s V for larger tables. It should not be used for ordinal data.

Nominal data gives you counts. Ordinal data gives you a little more. It’s counts of individuals that display particular attributes, but those attributes – the categories of the variable – also have order. For ordinal data, you can rank the categories. Usually, the only order you can provide for states is alphabetical order of names, which is irrelevant for most statistical purposes. In that case, states is a nominal variable, but you might want to rank states by population size and then it becomes an ordinal variable.

Ordinal data are usually recorded by rank number, typically making the largest, the most important, the top category number 1. The largest State by area is Alaska, so it would get a numerical value of 1, unless you are ordering states by population, in which case, for 2017, California would get number 1.

There are several statistics that can be applied to contingency table data that makes use of that extra information packed into ordinal data. They tend to be asymmetric and they may or may not pay attention to tied values. It’s a good idea to look at the table when deciding which statistic to use and see how many cells contain the same counts.

Maurice Kendall came up with the tau rank-order correlation coefficient and it has three different forms. These statistics compare numbers of concordant and discordant pairs of observations. For a concordant pair of observations, both the independent and dependent variables’ ranks of one is larger than the other (or both are smaller). For a discordant pair, the independent variable of one is larger than the other while the dependent variable is smaller, or vice versa. Subtracting the number of discordant pairs from the number of concordant pairs and normalizing the result by dividing by a measure of standard error (half the number of observations multiplied by the number of observtaions minus 1), you can see why the result will be negative if the number of discordant pairs are larger than the number of concordant pairs – if the ranks are indirectly related. The normalization scales the difference to a range between -1 and 1.

Whereas, in contingency tables of nominal data, the order of the categories of the independent variable (the rows), and those of the dependent variable (the columns) does not matter. For ordinal data, the orders matter very much. You can order them any way you wish, but you have to interpret the results with the orders in mind. Just for the sake of illustration, let’s assume that smiles are better than no-smiles. Our contingency table orders the independent variable from better to worse (from top to bottom rows) and the dependent variable from better to worse (from left to right columns). By convention, this is the expected order for a spreadsheet.

The three forms are tau-a, tau-b, and tau-c. They are designed to apply to raw data, but procedures have been developed to derive them from contingency table counts. Tau-a is a measure of association of joint counts (contingency table data) but it doesn’t take ties into consideration. Ties are observation pairs where the independent and dependent variable ranks are the same. In a contingency table set up so that rows are ranked from top to bottom and columns are ranked from left to right, the diagonal cells are ties. Clearly, our contingency table has lots of ties (108 of them), so tau-a would not be the best tau coefficient for us to use for this data. Tau-a Y|X works out to be 0.14. It’s ASE is 0.33.

Tau-b does not adjust for the shape of the contingency table, so it’s best used if the contingency table has the same number of rows and columns. Tau-c does adjust for that, so it’s the better statistic to use if the table is not square and has zeros on the diagonal.

Tau-b is our best bet. It’s value for Y|X is 0.42, which is respectable, and the ASE1 (since we assume that our alternate hypothesis that tau-b does not equal 0) is about 0.62. We can set up a 95% confidence interval to see if the statistic value might be zero given random error, and it comes out to 0.30 to 0.55, and zero is nowhere near that. The tau value is positive indicating that these variables are positively related. If I smile, there is a strong tendency for the other person to smile back, and vice versa.

Goodman and Krusskal’s gamma coefficient ignores ties. It’s the difference between discordant and concordant pairs divided by their sum. A statistic called “Yule’s Q” is a special case of gamma for a 2×2 tables. For our data, gamma is 0.82 and the ASE, assuming we reject the null hypothesis is 0.12. Our alternate hypothesis is still safe.

Somers’ d divides the difference between the number of concordant and discrdant pairs by the number of pairs with unequal independent variable values. As such, it’s a modification of tau. It is a very stable statistic and is one of the most widely used measures of ordinal association. For our data, it is 0.43 with an ASE of 0.06 yielding a 95% confidence interval of 0.3 to 0.55. Again, 0 is nowhere near our confidence interval.

Another correlation coefficient used for ordinal data is Spearman’s rank-order correlation coefficient. We will be talking much more about it in the next section so, for now, I will just say that it gives us a negative correlation for our data of 0.59 which is a moderate correlation.

Note that all these correlation coefficients – the ones that range from -1 to 1 can be used in the same way. They can be used to calculate part, partial, and total correlation coefficients, for example.

Throwing an interval variable into the mix

For mixed level data, there are several options. If one variable is continuous and the other discrete, perhaps the easy path is to either discretize the continuous data or treat the discrete data as a continuous variable. Both are problematic.

You discretize a continuous variable when you create a frequency table. Divide the range of the data into a number of equal intervals and count the number of data values that go into each interval. The problem is that, by discretizing a continuous variable you lose the “real number properties” of the measured values including any real zero that allows you to use ratios to compare values. You lose extreme values that might be important and you lose precision.

By treating ordinal or nominal data as continuous, you are including false assumptions about the nature of the data that might lead you to erroneous conclusions. If you use Pearson’s correlation to work with ordinal values, you are throwing away the information contained in the order.

These are cheap and easy solution but they only yield approximations, and not very good approximations at that, so it’s best to reserve them for exploratory techniques.

There are some slightly more labor intensive methods designed for mixed level data. For instance, discriminant function analysis relates continuous data to categories of a discrete variable.

The overall workhorse of statistics is regression analysis. If you know how to specify the model, it doesn’t really matter what the data is like, you can build an equation that describes the relationship between any number of variables. We will be looking at linear regression in a couple of sessions, and other forms further along.

What if I have three discrete variables?

There’s nothing that says that you have to limit yourself to only two nominal variables. If you throw in another variable, you could put the categories of one variable into rows, those of another into columns, and the categories of the final variable into layers.

For instance, what if I had noted whether each of the people I met on Bear Creek was male or female? I would have two regular contingency tables – one for the males and one for the females – and then I would stack them to form a three dimensional contingency table.

Stub-and-banner tables

Of course a table with volume is hard to grasp visually. That’s why we would use a stub-and-banner table to display the data. Here’s a stub-and-banner table for some smiles data that I made up.

We’ve talked about stub-and-banner tables before and you already know that there are usually several ways to organize them. This one shows the female and male data side by side.

The variables seem to move in the same directions for both layers but the proportions seem to be different. For instance, about two-thirds of the females smiled back at me when I smiled at them, but only about half the males did. Is this difference in patterns between the layers large enough to be significant? Another way to ask the question is, “Does the sex of the respondent make any difference for how often a person returns my smile?”

There are three kinds of variables here. The independent variable is still I smile/I don’t smile, and the dependent variable is still “They smile back/they don’t smile back. But the third variable, sex, is a conditional or interveining variable. It comes between a stimulus (“I smile”) and a response (“They smile back”) to modify the response. (Or, maybe it doesn’t.)

There are a lot of ways to test this. First, we can do a chi square test including all three variables to see if there is any relationship between the three variables.

The expected and chi square cell values are calculated for each layer separately, and then all the chi square cell values are summed. The degrees of freedom is the product of one less than the numbrs of rows, columns, and layers – in the case of this 2x2x2 table – 1.

This table yields a chi square of 27.46 and a very tiny p value of 0.00000016, indicating that, yes, there is a relationship somewhere.

It is not unreasonable at this point to run a chi square test on each layer of the contingency table. The female layer yields a Fisher’s Exact Test probability of close to 1, indicating no relationship. The male layer also shows a tiny chi square (0.0002) and phi (0.002) indicating no relationship – so where is the relationship shown in the overall chi square?

Is it a strong relationship? A contingency coefficient of 0.39 indicates a moderately strong relationship. But there are some special statistics for contingency tables with conditional variables that can give us more details.

Common odds ratio

We can look at the ratio of the odds ratios. The odds ratio of the female layer is 3.78 and that for the male layer is 15.58. They certainly look different. Comparing them yields an estimated common odds ratio of 5.46. What is the 95% confidence interval? 5.10 to 5.82. 1 does not occur there. There seems, by all reports, to be a real relationship somewhere.

Conditional independence

If, given one variable, knowing something about another variable doesn’t give you any extra information about another variable, the other two variables are conditionally independent. Otherwise, the two variables are dependent under the conditions of the third variable.

Whoa! That even confuses me. Let’s look at the data at hand.

Okay, you know whether a subject is male or female. If, knowing that, “I smile” and “They smile back” still look unrelated, those two variables are conditionally independent. On the other hand, we have seen that “I smile” and “They smile back” yield strong measures of association. They seem to be related. But what if they are only related because they are related through some other variable. If that other variable is controlled for – if only one sex is considered, the relationship goes away, then “I smile” and “They smile back” are said to be conditionally dependent, or conditionally associated.

The three dimensional smile data is the same as the two dimensional data that I actually collected on Bear Creek Trail, except that I have split the groups into two further groups by sex and I have done that fictitionally.

Can we tell if smiles are really contageous or if they only seem that way because males and females resspond differently to smiles?

Of course we can. There are statistics that do that.

The common odds ratio indicates that there is a discrepancy between the way males and females react to smiles, and since each layer also does not show a strong relationship, we can assume that the relationship in between the two variables and the third variable, sex.

Cochran’s statistic will test for conditional independence. The null hypothesis tested is that the odds ratios for each of the layers is equal to 1. You only have to look at one cell to test that – Cochran looks at the northwest cell (1,1). If the null hypothesis is false, we would expect that, for cells with an odds ratio greater than one, the difference between the observed and expected frequency would be greater than zero – that the odds ratio and residual would be in agreement. Also, if the odds ratio is less than zero, the residual will be less than zero. And if the odds ratio is equal to one, then the residual should be close to zero.

The Cochran statistic simply compares the sum of the residuals for the 1,1 cell with the standard error (look familiar?).

There is also a Mantel-Haenzsel statistic that includes a correction for continuity. Both of these statistics follow a chi square distribution, so we can determine p values. For our data, the statistics are both close to 5, which is pretty small for a chi square statistic. The p value for Cochran is 0.0000006 and for Mantel-Haenzsel is 0.0000016, so we can safely reject the null hypothesis of conditional independence.

Homogeneity

Where Cochran and Mantel-Haenzsel tests whether the odds ratios are all equal to 1, homegeneity of odds ratios indicate a weaker version of conditional independence – whether the odds ratios are merely the same.

Breslow-Day’s statistic tests for this state of affairs and Taron’s statistic includes a correction for Breslow-Day, but doesn’t yield enough of a difference to make a difference. Here, both are in that magical region between p=0.05 and p=0.01, that you can choose whether you want to reject the null hypothesis or not.

Breslow-Day: p=0.420Tarone: p=0.422
Nominal regression – a preview

I’ve said it before – regression analysis is the workhorse of statistics. You can determine the relationship between any kind and any number of variables using it you just have to knnow how to set up the model.

Linear regression is for continuous and normal variables that you expect to have a linear relationship (one that can be described graphically as a straight line, or algebraicly by a formula like y=ax+b), but there are nonlinear regression procedures, nonparametric regression, and robust regression (ordinary linear regression is sensitive to outliers).

I mentioned above that mixed level variables require some thought to analyze. If you have a continuous variable with a nominal variable, you can use logistic regression. If the dependent variable is nominal and there are several independent continuous variables, there is multinominal logistic regression. You can modify the logistic regression a little for ordinal dependent variables. Log-linear regression can be usd to analyze nominal predictors, and log-log regression can be used when both independent and dependent variables are nominal. Ordinal regression, Poisson regression….

I hope you get the idea – there’s a special regression procedure for just about any statistical model. We’ll be looking at many of these regression models, starting two pages down the line with ordinary linear regression.

First we need to look at a simpler cousin in more detail (we’ve talked a little about correlation as an exploratory strategy) and we need to see how to quantitize similarity between variables.

Causal analysis – another preview

We’ve talked about correlation and we will most certainly talk about it more in the future. It explains how variables vary with each other. But the most famous phrase about correlation isn’t about what it is or what it can do, but about what it is not: “Correlation is not causation.” That is because it is so easy for researchers to interpret correlation as causation.

It’s very tempting when you see a strong pattern of change between to variables to say that change in one causes a change in the other, but that’s dangerous. For instance, both variables may be affected by some other variable in such a way that it just looks like they’re directly connected.

Take the story of how Pontiac discovered vapor lock. A family would go to the local grocery store to buy ice cream. If they bought vanilla ice cream, their car would not start when they left the grocery store. No other flavor would do it. So vanilla ice cream caused the car to stall, right?

The family sent a complaint to the Pontiac corporation and, to their credit, they took it seriously and sent people to investigate. It turned out that the grocery store had placed the vanilla ice cream in the freezers at the back of the store to encourage people to buy other flavors, which they placed near the front of the store. Consequently, it took longer to buy vanilla ice cream – just long enough for the car to develop a vapor lock.

A chi square analysis wuld have shown a strong association between purchases of vanilla ice cream and stalling cars.

But although correlation coefficients and regression analysis do not, by themselves, indicate causality, they provide vital hints that can be put together like a puzzle to work out causal connections. They form the core of methods called causal analysis and we will be talking about that in future sections of the StatFiles.