Descriptive and exploratory statistics

Descriptive and exploratory

Where do you fit in? Where do you stand? What I mean is – how tall are you? Not just the distance from your feet to your head, but how do you compare in size to everyone else in the world (let’s restrict this to hominids.) How do you go about getting a handle on this?

What a statistician would do to start is simply look at the data. In most studies, statisticians and other scientists begin with very little to go on. If they’ve already collected the data, it tends to be a messy, incomprehensible collection of numbers.

Here are some numbers.

The tallest person in modern history was Robert Pershing Wadlow of Illinois at 272 cm. The shortest on record was Chandra Bahadur Dangi of Nepal at 54.6 cm. Ignoring the vast variation between geographic locations and even in the same person from year to year and over 24 hours, the male average height is around 177.8 cm with a standard deviation of about 10.2 cm. The average female height is about 165.1 cm with a standard deviation of about 8.9 cm. And here is a sample of human heights.

Heights of 33 humans in centimeters 
   146.88   
   153.31
   156.42 
   157.29
   158.32
   160
   161.88
   162.86
   163.44
   163.65
   163.8
   165.48
   166.28
   166.43
   168.09
   168.41
   169.46
   169.91
   171.28
   173.36
   173.4
   174.11
   174.23
   175.84
   176.33
   176.93
   177.86
   183.1
   183.8
   185.74
   186.83
   193.41
   196.41 

So, now, you can throw you’re height into the mix. Put your back against the wall and, using a straight edge across the top of your head, make a mark with a pencil on the wall even with the top of your head. Then, using a long ruler or tape measure, measure the distance from the floor to the mark. If you’re measurement is in some units other than centimeters, convert the distance to centimeters.

There are two purposes for these preliminary statistics. One is to just get an idea, off the cuff, what the data is doing. Exploratory statistics, on the other hand, answer specific questions about how you should proceed with your analysis of the data.

What does data look like?

It looks like that stuff above on this page – it’s numbers, sometimes words (as in context analysis), or it might be photographs or paintings if you’re in the United States Geological Survey or are a museum curator. It could be anything.

Order is established first by tabling the data. There is a standard way of organizing data in a table. You might want to download DANSYS or DANSYSX and the DANSYS User Guide and look at the section about entering data into a spreadsheet. Here’s the drill.

Cases each has its own row. Variables have their own columns. There might be row labels to specify which case goes in each row, but there should be column labels to tell what the numbers in a column mean. There is only one column in the above table because only one thing has been measured and the column label tells it all – Heights of 33 humans in centimeters. In the paragraph above the table, I include some more data. Those are worldwide statistics for human height.

You could add more order in the table by sorting the values, and I have done that. Can you make any sense out of the numbers? Well, one thing you can see right off is that we’re dealing with 33 subjects. That’s not a huge sample considering the 7.6 billion people in the world (at the time of this writing, give or take a few thousand or million), but it’s good enough for us to play around with so I can illustrate what I’m talking about.

You can also see right off that our subjects heights range from 146.88 cm to 196.41 cm (Woof! Tall guy!) You might also be able to see that heights tend to be closer to the middle values in the table (168.41 cm-169.46 cm) than farther away. There’s a jump of 6.43 cm from the shortest person to the next shortest, and a jump of 3 cm from the tallest to the next tallest. 3 cm differences are not uncommon but that first entry is surprisingly short and may be what we refer to as an “outlier”. Any time a statistician sees an outlier, they have to decide whether it was a mistake in measurement or just a surprising value, and then they have to figure out what to do with it (take it out, leave it in, transform all the data, what?) You might be able to see other patterns, but it’s not easy to make a lot of sense out of raw data. That’s where exploratory and descriptive statistics come in.

There are other standard table formats that might be used on a spreadsheet like contingency tables and stub-and-banner tables, but we will look at those later. They’re normally used for nominal and ordinal data (and I’ll explain that later, too.)

When working with a spreadsheet program like Excel, or LibreOffice Calc, or DANSYS, you should make it a habit to enter data into tables in the form that the spreadsheet expects and all the spreadsheets I can think of uses the standard order I described above. Computers are easily confused.

Summaries

Most statistics applications will automate data profiles. A data profile displays several common statistics and all you have to do is select your data and feed it to the procedure. For instance, LibreOffice Calc has a Descriptive Statistics command in the Data>Statistics menu. All you have to do is tell it where your data is (you can even select your data before opening the command dialog and the procedure will automatically go to your data) and where to put the results. Here is the display for our height data.

I’ll tell you what all this means below. DANSYS makes several profiles available and you can choose the one you want according to how much information you need. The simplest function, Profile8, gives you the number of data values in the table, the mean, standard deviation, minimum, 1st quartile, median, 3rd quartile, and maximum. The big boy, Profile13, gives you the count, mean, standard error of the mean, variance, standard deviation, minimum, maximum, sum of values, skewness and standard error of the skewness, kurtosis and standard error of the kurtosis, median, mode, interquartile range, and entropy.

Pretty pictures

Numbers are nice, but pictures are better. Our supreme instruments for picking up patterns in data are our eyes. Since our data are already sorted, let’s just graph them and see where they stand in relations to each other.

This graph is a straight line and doesn’t give much of an idea of the variability of the data. Let’s remove the space above and below the curve to emphasize the vertical dimension.

That’s better. You can see that both ends of the data are rather extreme.

We’ve looked at histograms in other StatFiles, so you know what a histogram is. Let’s look at one for this data.

This graph is a little too lopsided to be of a normal distribution, but that is probably only due to the small number of subjects. Skewness (in the table above) is what we look at to see if a distribution is too lopsided to be considered normal. A skewness of 0 implies a symmetrical distribution. A skewness of 0.3 is a pretty serious departure from symmetry and we would do well to keep that in mind (you’ll see why below.)

There are some things to think about when creating statistical graphs because pictures can lie.
Proportion is very important. Here are a couple of graphs.

These are scatter charts. values of one variable are charted on the horizontal (x) axis and values of another are charted on the vertical (y) axis. If the two variables are related, the resulting dots should show some kind of pattern. For instance, in the right scatter chart, as the variable charted on the y axis increases, the one charted on the x axis increases hardly at all or may decrease slowly. The values of the two variables in the left chart don’t seem to have anything at all to do with each other…yet, they’re the same data.

If you notice the x axes of the two charts, you’ll see that the x axis of the right chart is about three times as long as that of the left chart. That means that the data points in the right chart are artificially crushed into a narrow band.
Now, sometimes you want to do things like that, but to be honests, you certainly need to explain how you modified the graph and why. Scattercharts should have similar axes, if possible. 

You want some charts to be elongated. If you are dealing with a time series, a graph showing a sequence of values over an interval of time, stretching the horizontal axis can emphasize the sequential nature of time data.

Anything you can do to help get the accurate points across (pun intended) should be done. The second consideration is aesthetics. If you can, you should use color coding to clarify important parts of the graph, and you should try to make it pleasing to view. Most graphing utilities, such as that in LibreOffice Calc try to do just that. If you graph several data series on the same graph, Calc will place them in different colors with different shaped data points, unless you tell it to do otherwise. Often, graphing utilities will over do it and make things look gaudy and confused.

There are a lot of different kinds of statistical graphs that can give you a jump on figuring out what data is doing before you’re serious analysis even starts. We’ll be looking a many of them as we go along. For instance, we’ll look much more closely at scatter charts when we talk about correlation. Different kinds of charts emphasize different aspects of data and some charts have special uses.

Levels of measurement, again

We talked about nominal, ordinal, interval and ratio measurements and data way back on the first page of this series. Here they are again.

Before you even start working with the data that you’ve collected, you need to make sure you understand the level of the data because different levels of data are handled differently.

Remember, when you’re measuring things, measure them at the highest level possible and practical. The higher the level of your measurement procedures, the more information you can pack into your measurements, but the more it costs, so there is a trade off to consider.

There are also good reasons to handle data as though it were a a different level. You can discretize continuous data to form ordinal or nominal data, for instance, by categorizing each value as to the interval it falls within like you do when you create a frequency table, but never, ever let go of the higher level data. That is your best data. In essence, you discretize (turn continuous data into discrete data) when you create a histogram and you do that to see what the distribution the data came from looks like.

You can also treat ordinal or nominal data like continuous data just to see what happens because there are many more procedures out there for continuous data than there is for discrete data, but keep in mind that you’re actually cheating and say what you did if you report any of those results. Such results imply that you have more information content in your data than you actually do have. It’s inappropriate to use them to infer conclusions although they may be used to get ideas for further studies.

The center always holds

If you had a data set and you chopped it up so that each value was on its own slip of paper, and you put all the slips in a bag and shook it up really well, and drew one slip out without looking at it; what would be your best guess as to what value was on the slip?

The answer is, “the average.” That’s why the average is often called the “expected value.”

It’s not really that simple because there are actually many kinds of averages. “Average” and “mean” are the same thing. If someone says, “the average” or “the mean” without any qualification, they are usually talking about the arithmetic average. And different kinds of means do measure “the middle” in different ways.

The arithmetic average is calculated by adding all the data values in a set and dividing the sum by the number of data values. For a symmetric distribution, like the normal distribution (and that’s why the arithmetic mean is the most common mean – because it is appropriate for use with the most common distribution), the arithmetic mean splits the distribution into mirror image halves.

Consider summing a bunch of data values as lining up line segments of the various lengths of the data values. You end up with one long line that has a length that is the sum of all the data values. Then, when you divide by the number of data values, you are chopping the line up into equal lengths but the number is still the same as the original number of data values. 

It’s like Procrustes in Greek mythology, who would invite travelers to stay overnight and, if they were too tall for the bed, he would chop off extremities until they were the right size, and if they were too short, he would put them on the rack and stretch them out.

It’s like rescaling all the data values until they are all the same length without making the sum line longer or shorter. The arithmetic average is the equal interval.

A mean is also said to be a “measure of central tendency” because it is usually a good measure of the center of a distribution – the value that all the other values center around. Think of a party, and Bill Murray shows up. Suddenly the most crowded part of the room is where Bill Murray is. That’s the mean.

The arithmetic mean is the dead center for a symmetric mean. It’s the best guess for what a data value is going to be if you know nothing else about the data. But remember the graph of intererruption times for Old Faithful. Let’s look at that again.

The average is 71.01. Now, would that be a good guess for a randomly drawn data point from this distribution. There’s, maybe, 19 data values in the same frequency class out of 222.

The problem here is that this distribution is far from symmetric. Is there a better choice?

Special means

Here’s another shot of the same histogram but a light blue line has been included for a value called the median. That’s another kind of mean.

If you sort all the data values in a data set and select the one in the exact middle – the one that divides the physical data set exactly in half with 50% of the values on one side and 50% on the other side – that value is the median. It is an ordinal mean since it has nothing to do with the size of the data values and everything to do with their order.
The median, 75, is right in the middle of the largest frequency class of the histogram – about 48 observations are there, 22% of all the observations. That looks like a much better choice for our mysterious data point.

But what if there isn’t a middle point in a data set. That happens when there is an even number of data points. In that case, you take the middle two points and average them.

The big problem here with the arithmetic mean is that it is very sensitive to outliers (extremely large or small data values). What is the average of the series: 1,1,1,1,1,1,1,1,1,1. It’s, of course, 1. If you had to guess what a randomly selected value from this series was, what would you pick? Well, 1, of course. Now lets throw in a really big number: 91,1,1,1,1,1,1,1,1,1. Now what’s the average? 100/10=10. Is 10 your best choice? No, your best guess is still 1. 

But the median is still 1, and is the best guess for this data set.

Another way to deal with outliers is to just chop off a certain number or proportion of values at the top and bottom of the ordered list of data value. The resulting arithmetic mean is called a “trimmed mean”. This is a “robust statistics” because it takes a lot to phase it. There are a lot of robust statistics available but they tend to be somewhat involved, so I will wait until a later discussion so that I can give them due consideration.

The modal class for the Old Faithful data is centered on 75.13, very close to the median.

I said “modal class” and not “mode” because you don’t usually talk about the mode of continuous data. The mode is the most frequent value. If you have continuous data, all the data points are assumed to be distinct. You might have two people that are 165 centimeters tall in a group, but that’s because you didn’t measure them to the millimeter, or to the nanometer. At some level, they will have a different height. No two people are exactly the same height (or, to put it a different way, the probability of two values from a  continuous distribution having the exact same value is very, very close to 0.00).

But in nominal data, made up of counts, the values are exact and you can have a “most frequent data value”. That is why the mode is the most appropriate measure of central tendency for nominal data. Like the median, it tends to be pretty insensitive to extreme values.

There are other ways to find a middle value. Usually it involves collecting data values by some mathematical operation, then undoing the operation to find some kind of equal interval scaling. If you use multiplication instead of addition, you find the geometric mean.

To calculate a geometric mean, multiply all the data values together and then take the nth root of the result (where n is the number of data values.) 

This mean is a convenient way to get a handle on where the middle is in a collection of extreme values expressed in scientific notation, because multiplication and exponentiation is easier when dealing with exponents. An electron is around 10-18 meters across (sorta – ignoring all the quantum weirdness). The known universe is about 8.8×1026 meters across. What’s in between? The square root of the two – 8.8×108 – which would be about 3×104  meters, or 30 kilometers.

The geometric mean is more approprite for averaging growth rates (which are usually exponential or proportional. Let’s say that a city, starting an 35 square kilometers, grows to 42 square kilometers in a year, 50 square kilometers the next year, and 77 square kilometers the next year. What is a good average for the growth rate. The first years growth rate was 20%. The next year it was 19%, the third year, it was 54%. If we use the arithmetic mean to find a mean rate, we would get 31%. For three years, we would expect the city to be 31% bigger the second year, or 45.8 square kilometers, 60.1 square kilometers the second year, and 78.7 square kilometers the third year. Since the city was 77 square kilometers in the third year, that overstated the size by 1.7 square kilometers.

The geometriic mean of the cube root of 1.4*1.19*1.54 is a more modest 1.30%. That would predict a growth to 45.5 square kilometers the first year, 59.2 square kilometers the second years, and 77 square kilometers the third year. The geometric mean is clearly the better choice here.

Ratios such as average aspect ratios in multimedia are also better represented by geometric means.

Harmonic means are formed by adding the reciprocals of data values and dividing the number of data points by the result. They are appropriate for averaging rates. For instance, if a car travels at 50 kilometers per hour for 10 kilometers, 75 kilometers per hour for the next 10 kilometers, and 60 kilometers per hour for the last 10 kilometers, it will have traveled 0.5 hours. The arithmetic average of the speeds is 61.7 kilometers per hour. At a half hour, that works out to 30.8 kilometers, but the car traveled 30 kilometers. The harmonic mean, on the other hand, is just 60 kilometers per hour which indicates that the car traveled 30 kilometers in a half hour.

Another kind of mean is the root mean square (RMS). It is commonly used in electronics to analyze alternating currents and other cycling waveforms. For alternating current, the RMS is the same as direct current that would produce the same average power dissipation in a resistance. It is the square root of the sum of the squares of the data values divided by the number of data values. For continuous functions like electric waveforms, you have to get into integral calculus, but it’s not too difficult. Electrical engineers usually have formulas they can use that bypass the calculus, or they just let their RMS averaging meters do the calculations for them.

The RMS of the differences between paired data points is often used in statistics as a measure of error and you will often see terms in statistical formulas that look like .[RMS]
This statistic measures the variation of the error, not around the mean, but around 0. If you look at the standard deviation formula below, you will see a big resemblance to a root mean square average.

There are many other kinds of means, but the ones discussed above are the most commonly used.

And, since I mentioned standard deviations……

How does it spread?

Once you know where the center of the data is, you will want to know how data is distributed around the mean. This is called “spread” or “variability”. It’s best not to call it “variance” because that’s the name of a specific measure of spread.

When Abraham Lincoln was asked how long a man’s legs should be, he replied, “Long enough to touch the ground.” So, how far does data spread? From the smallest value to the largest! The range is the simplest measure of spread. It is found by subtracting the smallest data value from the largest. It is the distance from the two extremes of a data set, but, as yu will see, much more information can be packed into a single number.

Still, range is about all you can expect for a measure of spread in nominal data.

There are actually several kinds of range that can be used with ordered data and that includes a little more information – information about order.

The most popular measurement of spread for ordinal data is the interquartile range. If you sort a set of ordinal data and divide the resulting table into four equal parts, the smallest values will be divided from the next group by the first quartile. The second group will be separated from the third by the second quartile, also called the median (the same measurement of central tendancy discussed above.) And the third group will be separated from the group of largest values by the third quartile. The interquartile range is the distance (the absolute difference) between the first and third quartiles. 

The interquartile range includes the middle 50% of all the data points in a data set. It is a trimmed measure, like the trimmed mean, and it removes the extreme values from both ends of a data set, so it is also a robust measure.

The interquartile range is often used to find outliers. If a value in a data set of less than the first quartile minus 1.5 x interquartile range, or more than the third quartile plus 1.5 x interquartile range, it is usually identified as an outlier.

The semi-interquartile range is half the interquartile range. The Mid-quartile range is the value midway between the first and third quartiles, or the arithmetic average or the first and third quartiles – (Q3+Q1)/2

Where quartiles divide a data set into fourths, percentiles divide it into 100ths, Ranges analogous to the quartile ranges described above can be defined for various percentiles. For instance, the range between the 10th and 90th percentile is sometimes used. It can be divided by two to find a semi-percentile range (10-90%). And the mid-percentile range is actually a trimmed mean. This kind of setup is useful if the statistician wants to remove a specified amount of the data off the top and bottom to “interactively” deal with extreme values.

These range values capture the distance between two distinct values in a set of data – usually a large and a small value – to get an idea of how “big” the set is, but what if you could have one number that will give you an idea of how all the data points behave around the central value. Wouldn’t that be cool?

Well, of course, there are several statistics that do just that.

You could add up the distances of all the data values from their mean but, in a symmetric distribution like a normal distribution, all the positive differences (called “deviations” from the mean. or “residuals”) would cancel out the negative deviations. In a normal distribution, you would always get 0 so that isn’t very useful (except, maybe, for a test of the symmetry of distributions.) There are two ways to eliminate that problem – use the absolute deviations (the distances from the mean, ignoring the signs) or the squares of the distances from the mean. And keep in mind that you do not always have to use the arithmetic mean.
The sum of absolute deviations (a SAD statistic) is sometimes used as a measure of spread and is more often used as a measure of similarity between multiple data sets by calculating differences between paired values instead of the distance between values and their mean. More often, the mean absolute deviation (or MAD) is used, which is the SAD divided by the number of data points. It tells you, on the average, how far data points are from the mean value.

There are two big problems with MAD. First, it’s not limited. It just tells you the average deviation – it doesn’t tell you how the deviation behaves across the entire data set. Second, it is a biased estimate of the variability. If you averaged all the sample MADs from a population, the average wouldn’t approach the population MAD.

What about the mean of the squared deviations (MSDM – the squared deviations from the mean)? This statistic is often used in other statistical procedures, like ANOVA and linear regression. For a normal distribution, this is the best unbiased estimator of the variance of a population, but that would not be true for many non-normal distributions. 
The mean of the squared deviations from the arithmetic mean is also called the variance and it has one last problem that needs to be fixed. It has units that are squares of the things being measured. Of course the fix is to take the square root and that gives us the very popular standard deviation.

Standard deviation

Why is the standard deviation popular?

Well, if you’re working with a normal distribution, then there is a good chance that any particular value in that distribution will be one standard deviation from the mean (on one side or the other). There’s a very good chance that it will be within two standard deviations. There’s a very, very good chance that it will be within three standard deviations.

Actually, there’s a 68.27% chance that the value will be within one standard deviation of the mean, 95.45% chance that it will be within two standard deviations, and 99.73% chance that it will be within three standard deviations.
Neat, huh?

Actually, the shape of a normal distribution is completely defined by the mean and standard deviation. Those are the two parameters that define the shape of the standard deviation.

Of course, if the distribution isn’t normal – all bets are off. Well, at least you  need to have another plan. That’s why, along with the measure of central location (mean) and spread, you need to know the shape of the distribution you’re working with.

But, before that, there are a few other measures of spread that you will want to know about.

The coefficient of variation compares the standard deviation to the mean. A coefficient of variation of 20% tells you that the standard deviation is 20% of the mean. It is a common statistic in quality assurance where someone is tracking the number of defects in a production process. It is, in fact, only relevant for data at a ratio level.

A more robust verison based on ordial measures is the quartile coefficient of dispersion, which is the semi-interquartile range divided by the midquartile range. There is also a version that uses the median as a measure of location – the median centered coefficient of variation.
The coefficient of variation has the advantage of being dimensionless (since it is a ratio and the units cancel out) so it is independent of scale. It can be used (where the standard deviation can not) when you are comparing data with different units.

Unfortunately, think of what happens when the mean of data is zero or close to zero – the ratio approaches infinity and is more and more sensitive to tiny changes.

While the standard deviation is the best measure of spread for a normally distributed variable, in an exponential distribution, the standard deviation is equal to the mean so that the coeffiicient of variation is 1 which makes the coefficient of variation extremely attractive for situations in which the exponential distribution appears often such as reliability theory, simulation and queueing theory.

Most of the above deals with the traditional system of statistics. There are others. Traditional statistics llook at the place of the individual data value in a distribution. A more holistiic approach looks at the information contained in a data set. The information equivalent measure of spread is entropy.

Entropy is a measure of order in a set of data or, seen another way, it is the amount  of uncertainty in a set of data. How good would your best guess be about a random value drawn from a collection of data?

Take a fair coin. Before tossing the coin, your best guess might be that it will come up heads. You would do just as well to say “tails”. Both have an equal chance of happening and your best guess is just as good as no guess at all. The result is entirely unpredictable.

Entropy is calculated as the negative sum of the probability of each value times the logarithm of the probability. The logarithm can have any base but the most common are 2 (and the result is in units called “bits”), e (Euler’s number, and the result is in nats) and 10 (the reult being in bats.) There are two possible states that occur with equal probability, so the information content is Log2 2 or 1 bit. An entropy of 1 bit  indicates a completely unpredictable result. On the other hand, an entropy of 0 bits indicates complete predictability since Log2 1 = 0. A process with three equiprobable outcomes would have an entropy of Log2 3 or about 1.58.

The more probable an event is the less information it carries. Information can never be negative. Events that always occur do not communicate information. Events that never occur do not communicate information either. Information from independent events is additive.

Looking at the distribution – histograms, ogives, and leaf and stem plots.

If you know the center of a distribution and the way data values distribute around it, there’s one more major thing you need to know before you have a good grasp of that data set – the shape of that distribution. That’s what the “Distributions” page is all about, so you might want to look back at that. But just to drive home the fact that shape is different from mean and variance, remember this.

Here’s another distribution with the same mean and variance.

They look a lot alike, don’t they. Of course they don’t. One distributes it’s data around the mean of 71 in a bimodal fashion and the other distributes 68% of the data within 12.8 units away from the mean in a normal distribution.

We know how to calculate the mean and standard deviation, but how do we “calculate” the shape of a distribution. Well, there are statistics that we can use, but we won’t look at that now. What we want is a quick and dirty way to get a “lay of the land” and that’s a perfect analogy. When you want a lay of the land, you use a map, and when you first meet a set of data, you will want to look at it’s map – graphs.

It’s easy to see in the histograms above that the two distributions are different. It’s easy to get an approximate idea of where the center lies and about what the spread is like. Furthermore, it’s easy to see that the first set of data (from Old Faithful) is bimodal and to guess that it hides two separate distributions (bimodality suggests that) and the other seems to be normal, because normal distributions look so….well, normal.

It would be nice to know how normal the second distribution is, so we can add an overlay graph of what the distribution would look like if it were perfectly normal. That would just be a frequency polygon of a perfectly normal distribution with the same mean and standard deviation.

The red line is a frequency polygon. It is made exactly like a histogram except that a histogram is a bar chart whose bars center on the class midpoints, and a frequency polygon is a line graph whose lines tie together points that are situated at the class midpoints. Either shows the frequency of data values in specific intervals. The histogram emphasizes the frequency in each class interval, and the polygon emphasizes the shape of the distribution; in this case, the classical “bell curve”.

What else can we do? We can look at the graph of the cumulative frequencies – each frequency added to the sum of all the other frequencies before it. Often this will highlight deviations from normality better than a simple histogram. This kind of graph is called an ogive (pronounced oh-guh-vee). Heres an ogive of the normal data.

This is actually a very nice, normal ogive. You would need the histogram with the reference curve to really see any departure from normal.

Another useful graphic for exploring data is a stem-and-leaf plot. Here’s one for the Old Faithful duration times in  minutes.

Notice that all the values are represented here. You can tell at a glance that there are 32 data points between 1 and 2 minutes and you can count that 7 of them are 1.7 (from the line begining 321). The first column tells you the number of observations in the class, the second tells you the class (lower interval limit), and the third lines up each value into a horizontal histogram. Notice that the class containig the median is marked with an asterisk (the median being 4) And it’s obvious that this is a bimodal distribution.

Quantiles and ntiles

Statistics are sometimes divided into parametric and nonparametric statistics. Parametric statisttics deal with specific distribution, usually normal distributions, and they are called “parametric” because the distributions they deal with are described by parameters. For instance, any normal distribution can be completely specified by two parameters – mean and variance. Nonparametric statistiics, on the other hand, are independent of distribution.

There are two kinds of quantiles. They can be parametric or nonparametric also. For parametric statistics, the quantiles are the cumulative distribution functions – often called the inverse functions in spreadsheets. For instance, most spreadsheets have a NORM or NORMAL function which will give you the value at which a particular percent of the distributiion lies below that point. For instance, for a normal distribution with a mean of 10 and standard deviation of 2, the value at which 21% of possible data values are smaller is 8. The inverse gives you the percentage of data values that are smaller than a particular data value. The distribution funtion is usually the density function. The inverse is the cumulative probability function.

The other kind of quantile is nonparametric. Sometimes called ntiles, they work by sorting all the data values and then dividing the sorted set into n parts. The ntiles are the values where the divisions occur. The two most popular quantiles are quartiles (which divide a data set into four parts) and percentiles (which divide a data set into 100 parts). Percentiles have more resolution, so they are more often used for placing particular data values, but for checking to see what a distribution looks like, quartiles are more popular.

The first quartile divides a data set into the least 25% of the observed values and the top 75%. The second quartile is what we have been calling the median – it divides a data set into two equal sized pieces. The third quartile divides the data into the smallest 75% of the data points and the top 25%.

Traditional statistics are not much good for the Old Faithful data because they are appropriate for normal distributions and it’s obvous that the Old Faithful data are not normal. But it’s also obvious that there are two different distributions hidden in the data, so what happens when you split the intereruption times into two groups – one for eruptions of less than 3 minutes and one for longer eruptions?

Here are the descriptive statistics for the two groups:

For the ❤ minute group, the distance from the 1st quartile to the median is 9 minutes and from the median to the 3rd quartile is 2.5 minutes. For the >3 minute group, the distance from the first quartile to the median is 4 minutes and from the median to the third quartile is 4.75 minutes. The second group is fairly symmetric  and might be normal, but not the ❤ minute group, so a nonparametric approach seems best since you can use it with either normal or nonnormal data.

The visual version of a quartile lineup is called a box plot and it is surprisingly informative for a graph.

The box plot

Here is a box plot for the two groups.

This box plot was generated by a software package called PAST, developed by Oyvind Hammer and D. A. T. Harper and intended as a statistics package for paleontological research. Maybe so, but it’s a great general purpose package with a spreadsheet-like interface. You can find it here:

https://folk.uio.no/ohammer/past

The left plot is for the ❤ minute data, and the right plot is for the >3 minute data. The bottom and top edges of the boxes are at the first and third quartiles. The line inside a box is the median. The whiskers (lines) extend down to the minimum data values and up to the maximum data values. Some box plots isolate outliers as individual points. This one does not, but it’s apparent that the ❤ minute group has at least one large value that is a borderline outlier (2.5 x semi-interquartile range), and the >3 minute group has at least one extremely small outlier.

The interquartile ranges are contained within the boxes.

Neither distribution is symmetric – the median would be in the center of the box if it were.  Also, those outliers would knock any parametric statistics way off. We will be talking about nonparametric statistics off and on in the future, but it’s pretty obvious from the box plot that they’re the way to go.

Also, it’s pretty obvious that there are two completely different groups of data here. If the bimodal histogram didn’t   clinch it, the box plot surely does. The interquartile ranges (the boxes) do not overlap at all.

Looking at two variables together

Here is a very simple equation.

y=x+3

The way it works is, you plug values into the variable, x, and that tells you what values of y are. If x is 1, then y is 4. If x is 5, then y is 8. Y is said to be dependent on x.

Sometimes, a data variable will be dependent on another data variable. When you make a statement like that, you are implying causation. The value of x causes the value of y. X comes before y. You know what y is going to do if you know what x is going to do.

Is body weight related to carbohydrate intake? If you eat more carbohydrates, will your weight increase? If so, eating carbohydrates casues weight gain and weight is dependent on carb intake.

Sometimes you can’t establish order. Having more eggs certainly leads to more chickens, but more chickens also produces more eggs. When the order of causality of two variables can’t be established between two related variables, they’re said to be codependent.

If, when one variable increases, the other also increases, they’re said to be “directly related”, or they’re said to “vary directly”. If, when one variable increases, another decreases, they’re said to be “indirectly related”, or they’re said to “Vary indirectly or inversely”.

The force pulling two magnets together is directly related to the strength of the magnets, but is inversely related to the distance between the magnets.

Univariate statistics look at only one variable at a time, say, height. If you want to look at how height related to weight, you will want to use bivariate statistics – statistics used to analyze the relationship between two variables at a time. A large chunk of statistics is devoted to bivariate statistics. If you want to analyze the interaction between more than two variables, you use multivariate statistics.
The most common bivariate statistics are contingency tables and their associated measures of association, and correlation coefficients. There will be whole sections on  those later. The most popular statistical procedure for multivariate analysis is regression analysis.

But, for now, let’s just look at how to get a preliminary idea of how data interacts.

Crosstabulations and stub-and-banner tables, pivot tables
A frequency table tells you how many times each value in a data set (or values in set intervals) appears in that data set. We’ve seen that before in developing histograms. But let’s say that we have 6, 8, and 12 sided dice (if you’re a RPG gamer, you know that these exists), and these dice can be red, blue, or green. Joint frequencies would tell us how many red 6-sided dice there are, how many red 8-sided dice, how many red 12-sided dice, blue 6-sided dice, and so on. If you set up these counts in a table it might look like:

Not only does this show the joint frequencies, but the row and column sums (also called “marginals”) tell how many red, blue, green; and 6, 8, and 12-sided dice there are.
And the grand total can also be calculated.

If you drew a dice out of the collection at random, what would the probability be that it would be a green, 12-sided dice? There are 4 green 12-sided dice in the batch and, with 23 dice in all, that would be 4/23 or about 17%. All the other probabilities could alo be calculated.

This is called a joint frequency, or contingency table, or crosstabulation.

You can figure out what the frequencies would be, given the same marginals, if the numbers were distributed at random by, for each cell, multiplying the row marginal of the cell by the column marginal and dividing the result by the grand totals. Here are the expected values.

Most of the expected values are pretty close to the actual values. Maybe the “collection” is actually a single purchase where the dice were randomly poured into a bag by a machine.

What if we had one more attribute to deal with, say, rough and smooth sides? Three variables would be rather unwieldy for a crosstabulation, but it could be expanded by nesting attributes like this:

This is called a “stub-and-banner table” and you can see a lot about your collection here. For instance, all the 6-sided dice are smooth.

There are special statistics for dealing with contingency and stub-and-banner tabled data – chi square tests, measures of association, and such, but we’re looking at tools for preliminary evaluation here, so we’ll save those for later.

Covariation

Contingency table analysis is especially useful for exploring discrete data. For continuous data, covariation is more appropriate. It takes advantage of the continuous nature of the data to pack more information into the statistics.
The variance of a variable is the average distance of all the data values from the mean. To calculate it, you just subtract the mean from each of the data values. That gives you the distances. Then you add all the distances and divide the sum by the number of data points (actually, if you are working with a sample, you should divide by one less than the number of data points.) That gives you the average, and therefore the variance.

Covariance measures how two variables vary together. Instead of the average of the distances of data values in one variable, it is the average of the products of the distances of matched data points from their respective means.  Subtract the respective means of each variable from all the data values, then multiply each pair of matching distances. Sum all the products and divide the sum by the number of matched pairs (or one less than the number of data pairs in the case of a sample).

If the data points come from normal (or, at least, symmetric) distributions, and they are not related, then the distances will cancel in the sum and the average will be 0; therefore, a covariance of 0 indicates independence. If more of the products are negative, then the result will be negative indicating an indirect relationship. If more of the products are positive, a direct relationship will be indicated.

All this might be familiar from our discussion of measures of spread above. The variance is the square of the standard deviation. The covariance can be standardized by dividing it by the product of the variances of the two variables and you have a nicely behaving measure of covariation that is always between -1 and 1. It is called the correlation coefficient and it is a very popular statistic for both description and inference with bivariate data.

The visual version of the correlation coefficient is called a “scatterplot”.

The ubiquitous scatterplot

A scatterplot is a very simple graph that can shed surprising light of bivariate data. All it is, is a Cartesian (rectangular, x-y) chart on which the values of one variable is graphed on one axis and the corresponding values of another variable is charted on the other. What does the intereruption times and their corresponding eruption times look like on a scatterplot?

Here, you can clearly see that there are two groups of data and you can see that the divide naturally at the 3 minute eruption duration. There is very little overlap, and you can see a linear trend, especially in the >3 minute eruptions. The whole collection is very linear. Also, the isolated points scattered around the two main clouds are the outliers. If you were a geologist, you might want to look a little closer at these values to see what makes them different.

Explore your statistics

There’s not a lot of difference between descriptive and exploratory statistics. The difference is more philosophical than mechanical. You use descriptive statistics when you want to tell others what your data looks like. Exploratory statistics are for when you’re trying to get a feeling for your data and you’re looking for ideas about how to proceed with your analysis.

We’ve been looking at several different, popular desccriptive and exploratory statistics. Let’s switch gears a little and look at a few technical issues.

Robust statistics

I showed you why the arithmetic average is not always the best mean to use. Remember the case with the outlier? A few extreme values can drive many statistics all caterwampus. Robustness is how much a statistic can avoid the effects of extreme values. It’s also called “sensitivity”.
Some robust statistics are very simple. If you want a mean that ignores extreme values, just sort the data and chop off the extreme values. That’s called a “trimmed mean”. Usually, statistical procedures that return trimmed means let you chop of a percentage of the top and/or bottom values of a sorted data set.

But you only do that if you have a good reason to believe that the extreme values don’t belong – that they’re erroneous measurements or, perhaps, really from some other group of data and just, sorta snuck in to your data. Statistics based on quartiles have the same problems. They just arbitrarily excise data points. The outliers in the geyser data were not errors and they do obviously belong there. You wouldn’t want to ignore them completely.

There are many robust statistics that have been developed that attempt to weight extreme values in meaningful ways. A class of statistics called M-estimators try to minimize certain functions of data to provide alternative measures of central location and spreads. Often, probability functions for data far from a central measure are weighted less (perhaps, even at zero) than measures near the center. The actual derivation of such estimators may involve calculus (you know from our discussion of The Basics) that the derivative of a function is zero at a minimum point), but the effect is usually quite understandable.

For instance, a robust estimator of the mean called, Hubers m-estimator gradually decreases the weight of data points as the diverge from the center of the data. Another called Tukey’s Biweight sharply decreases the effects of data points as they get further from the mean and, at some points, just starts dropping them. Some robust measures of centrality are the trimmed mean, Windsorized mean, Andrew’s Wave, Hampel’s m-estimator, Huber’s m-estimator, and Tukey’s biweight.

These statistics are not universaly popular but they are useful. In the first place, they’re difficult to calculate by hand (not so much if you have the right software) and to understand. In the second place, to choose the “right” one, you need a solid understanding of the specific estimator and you need to decide how brutal you want to be to the extreme values. Further, there are parameters you must apply to the functions to say where they will “break-down” and start throwing out values. Unfortunately, there are no hard-and-fast rules as to how brutal you should be or how you should specify parameters, so you have to rely of your intuition as a statistician.

Perhaps, as with many decisions in statistics, the best tact is to look at the results of several estimators and see which one makes the best sense.

Here are the values of robust means of intereruption times for Old Faithful following eruptions of less and more than three minutes.

Keep in mind (from the box plot) that the group with shorter eruptions had large outliers which would tend to pull the mean up; and the group with long eruptions had small outliers, which would pull the mean down. The robust measures have moved the mean in the right direction. Andrew’s wave seems to be the most stringent measure where Hampel seems to be the most conservative. There is some variance as to whether the measures want the mean to move up or down, especially in the >3 minute category where the outlier is more severe.

There are also robust measures of spread and even of covariation and regression.

Statistics from grouped data

I’ve told you that you should measure your data at the highest level (nominal, ordinal, interval, ratio) possible given your resources, and you should hold onto your dataset for dear life. If you create a frequency table and then lose your raw data, you’ve lost valuable information and, what do you do? Go back and collect all that data again? Ask all those people the same questions (they won’t have so much fun the second time around)?

Well, it’s not a perfect solution, but there are methods of extracting information from things like frequency tables. This kind of data – discretized continuous data – is called “grouped data” and the idea is to use interpolation and extrapolation methods to approximate statistics. Since statistics, by their very nature, are estimations of population parameters, what you end up with is approximations of estimations but, if that’s all you have….

You’ve seen a frequency table on the distribution page and you can  probably predict what’s coming, but let me outline the procedure for finding and estimation of the mean. But let me point out that, if you have a stem-and-leaf chart instead of a frequency table, you can rebuild your raw data.

First, you have the frequency and midpoint of each bin. To find out the sum of the data values in a bin, all you have to do is multiply the frequency by the midpoint value. The sum of the frequencies is the actual number of raw data points and the sum of the bin totals is an approximation of the sum of data values. Dividing the sum of the data values by the sum of the frequencies will approximate the mean.
Why “approximate”? Because you probably used a spreadsheet to generate the frequency table and computers aren’t exact in their calculations. DANSYS only offers 15 to 16 decimal points of precision, and that sounds really (really!) good, but it’s not perfect and, with repeated sequential calculations, those round-off errors really add up. The more intermediate calculations you throw into the mix, the more you’re likely to see some significant error.

What do you do if the data you have is quartiles – maybe a box plot? You can get a pretty good idea of where the mean is by adding the minimum value to the maximum value plus 4 times the mode (or midpoint of the modal class), and divide the result by 6. The standard deviation can be estimated by subtracting the minimum value from the maximum value and dividing the result by 6.

What about estimating other statistics from group data? The median is easy – just count to the middle bin. If there are 50 data points, add the bin frequencies until you exceed 25 and that will be the median bin. How many data points come before that bin? Subtract that number from 25 to see how far into the bin the median is, say 2. But what is that data value? You can’t know but you can get an estimate by setting up a proportion. If there are 5 data values in the median bin, 2/5 is equal the data value over the minimum minus the maximum of the bin. It’s called linear interpolation.

To find the standard deviation, you need to find the estimated mean and midpoint of the range of the data (the midpoint is the maximum plus the minimum divided by 2.) Add up the products of the bin frequencies and the squared midpoint. Subtract the number of data points times the squared mean from the resulting sum.Then divide it all by one minus the number of data points to get the variance. The standard deviation is just the square root of the variance.

What do you want to know? (Questions and hypotheses)
Exploratory data analysis helps you to answer preliminary questions about a study.

The first thing you need is a topic. I grind my molars when I hear an instructor tell a class to find a topic they’re interested in. You don’t “find” a topic – you encounter a topic. The topic is already out there. If you walk around with you’re senses operating, interesting topics will assail you from all directions.

But a topic isn’t enough. You have to polish it into a research question. A research question restates a question of interest into a question about measurable quantities. “Does walking faster up a hill make me hotter?” becomes, “If I walk twice as fast up Mount Carbon, do I generate more body heat?” You’ve specified a measurable upgrade and your pace and body temperature are measurable and recordable.

Three of many procedures can help you figure out what variables you need to measure and how to measure them, working definitions for concepts you need to work with, and other details you need to concider for designing a study.

You can find out what others have said about your topic. Whereas the Internet in general is likely not a good source for this kind of information – anyone can publish on the Internet and they don’t have to know what they’re talking about – Google Scholar is a free citaton service that keeps track of peer reviewed, academic quality information. If you intend to publish your findings, make notes and keep bibliographic references. These will make up the first and last sections of your report.

If you can get a group of people who have personal experience with your topic, a panel is a great way to gather preliminary information. A panel group can be composed of academic or professional experts, and/or people that have relevant experiences in their everyday life. You don’t only need opinions – you need anecdotes, jargon, and folk knowledge.

A panel study is a moderated discussion with as set agenda about a subject. The moderator guides the talk through the information needed for the study and keeps the conversation on topic.

Case studies don’t make for good research. They usually rely on small, unrepresentative samples and provide little in the way of error control, but they can generate a wealth of ideas.

With a study question, you can then compose the two kinds of statements that drive your study – the hypotheses. Once you have an idea about what might be going on, you need to formulate that idea into a statement – once again measureable – “If I walk up Mount Carbon twice – once at a measured pace of one mile per hour, and once at a measured pace of two and a half miles per hour, given a day between and similar weather conditions, my recorded body temperature will show warmer trends on the second ascent.” That would be what’s called your “alternative hypothesis”. It’s an alternative hypothesis because you will also need a null hypothesis to go with it – something like, “If I walk up Mount Carbon twice – once at a measured pace of one mile per hour, and once at a measured pace of two and a half miles per hour, given a day between and similar weather conditions, my recorded body temperature will show no difference in trends between the first and second ascents.”

The null hypothesis, that you can’t find the answer that you believe you will find, is the important one, because that’s the kind of hypothesis that modern research tests. The p value reported in research reports is the probability that you could make a mistake in rejecting the null hypothesis.
It might seem strange that a researcher might go out of their way to prove their pet idea wrong but there are more than one reason that that’s the way to go. In the first place, it keeps researchers honest. Science is the search of truth and if you try your best to disprove that you are right and you fail, then there’s a good chance that you were right after all.

Then again, it’s easier to disprove something (all you need is one counter example) than it is to prove something. If you prove that something is true in one case, it doesn’t prove it in all cases.

And the p value reminds you that, in science, there’s always a chance that you’re wrong. Some computer programs might round 3.5×10-37 to 0 but p is never 1 or 0.
You can have any number of hypotheses in a study, but for every alternate hypothesis that tells what you think is going on, there must be at least one null hypothesis that says that you’re wrong.

The questions a priori – things you have to ask first
There are a few issues you must address before you even think about any other design issues.

“Research should follow theory,” is a common tenant among research specialists. In other words, you should ask your questions, design your study to answer those questions, and report the results. If your study delivers a large p value, that is the result and it’s important. It tells you that the answer yu expected wasn’t forthcomiing and follows with an important (and exciting) question, “Why?”

First, you want to determine how many subjects will be appropriate. We looked at that in the section on sampling and estimation. If you have a small sample (and, sometimes, that’s the best you can do), then at least use the appropriate statistics to get as much as you can from what you have and report everything you do honestly.

Do you need to make sure that there are certain kinds of people in your sample. You really need to control for error by using randomized samples and control groups, but if you know the makeup of your population, it doesn’t hurt to make the sample look as much like the population as you can – you still need to randomize within your strata. Again, we talked about that when we talked about samples and estimation.

We’ve mentioned the mysterious alpha, too, and we will be talking about it a lot in future episodes. That is something else you need to nail down before you start designing your study. Are you comfortable knowing that 5 times out of a hundred, with your data, you will make a mistake of rejecting a null hypothesis when you really should accept it? That 5 out of 100 is an alpha of 95%. The alpha of a confidence interval is what sets how confident you want to be about your results.

Traditionally, an alpha of 95% is good enough but, if you’re working on a drug that’s going to mean life or death, you might want a little more confidence. a 99% alpha is fairly common  and is considerably more stringent.
Before you start a study, you will need a study design. Study designs can be as complicated as you want to make them, but they need to be put together with your research questions in mind. We’ll be talking about research designs in a later section.

Finally, you need to knnow which groups you are going to compare before you start. The difference between a priori and post hoc comparisons is obvious – a priori comparisons are planned prior to the study and post hoc comparisons happened after you’ve dealt with your research questions. It’s not quite so clear what the big deal is about not using post hoc comparisons, though.

Post hoc studies are often called “data dredging” because the temptation is, if you don’t find what your looking for in data, to start looking for anything that might support your pet theories. If you look hard enough, you can support anything. Philosophically, post hoc just looks bad. But there are some real, probabilistic problems, too.

If you look at data once in a study, the p value you come up with is justified by probability. If you keep going back and testing the data, each extra test makes it more and more probable that you will find a particular outcome, whether it’s there or not. In essence, you’re sampling comparisons without replacement. If you use the standard statistical tests, you are not taking this repeated sampling into consideration and the p value you obtain will be as though it is from a single tests.

If you just have to look again, there are multiple comparison tests that can be used that do take this repeated sampling effect into consideration. Certainly, if you look at the results of a study and you see a surprising pattern that you want to follow up, look again, but be sure to use the right statistical procedures. Ideally, report your recommendations for further research and design a new study to test your new research hypotheses.

What do I do with all these numbers?

When you have collected your data, before you have designed your study, that is when you can be very free with the data. You can and should look at it from every angle to find the best way to approach it. Make graphs, generate descriptive statistics, use different transformations on it to see if that will give you any leverage. But never, ever let go of your raw data. Always allow yourself to backtrack, looking at what all you’ve done, to the raw data.
Is your data nominal, ordinal, interval, or ratio? That will suggest the kinds of analysis you should use.

Is your sample large enough to make inferences to the population?

How big are the differences between the means of the groups you’re looking at?

Do the graphs show any discernable patterns? Does it look like there might be relationships between the groups?

Does this count? Nominal data

The kind of data you have suggests the statistical methods you should bring to bear on them. Nominal data are data consisting of counts. If your data can only be described as attributes: like colors, genders, nationalities, and such, you’re working with nominal data.

The first thing you should think of when dealing with two or more nominal variables is frequency tables joint frequency tables (contingency tables), and association statistics, but don’t forget that there are several discrete probability distributions: binomial, Poisson, extreme value distributions – that require a different approach. A histogram may suggest that you’re dealing with some specific kind of distribution and you can be more accurate using goodness of fit tests to see if your data fits a particular distribution.

Bringing order to your life. Ordinal data

If your data consists of counts that can be ordered from largest to smallest (or alphabetically, or more to less important, or in any other way), or ratings, you are in the realm of ordinal data. Where box plots are appropriate for exploring just about any kind of data, they are especially useful for ordinal data. Descriptive statistics based on quartiles and other quantiles are appropriate. And there are entire suites of statistics specifically designed for ordinal data, organized in contingency tables or lists of paired values.

(There is actually a version of the box plot based on the mean and standard deviation, instead of quartiles. It’s called a diamond plot and, although it is much less common than the box plot, it may be more appropriate for continuous data.)

Measurements. Interval and rational data

Most traditional statistics are dsigned for normal data – measurements. Arithmetic means, t-tests and their most advanced versions – ANOVA and ANCOVA, Pearson correlation coefficients, regression analysis, etc., unless modified to accommodate other distributions are meant to be applied to normal data.

It is usually assumed that the same statistical procedures are equally appropriate for interval and ratio data. There are nonparametric statistics that don’t care what the distribution looks like and graphs are the ultimate nonparametric statistics, although you can’t pin down values of statistics with the accuracy provided by numerical methods.

Working with ratios and proportions

Ratios and proportions are fractions, but they’re fractions intended for certain tasks. A ratio compares two values. In a fraction, if the numerator has the same value as the denominator, the fraction is equal to 1. If the numerator is larger than the denominator, the ratio is larger than 1. If the numerator is the smaller value, the ratio is between 1 and 0. 

A proportion compares a component to it’s whole. “1 out of 6” or “1 to 6” describes the proportion of possibilities of getting a 1 (or any other single number) in a throw of a six sided dice.

Indexes are special ratios used as indicators, commonly in finance and economics.

Anytime you see a data set composed of ratios or proportions, you should think “binomial distribution.” It isn’t always the case, but it’s a really good place to start. Also, you’ve already seen that there are more appropriate means to use with ratios and proportions than the arithmetic mean, geometric and harmonic means, for example.

Weighted data

There are times that a statistician wants to emphasize certain data values more than others. You’ve already seen one instance. A way to deal with outliers is to give typical values greater weight – that’s what robust means do.
The way you weight data is to multiply each data value by some “weight constant”. Another reason to weight data is to make a sample look more like the population it’s drawn from. 

Samples should be randomly drawn so it’s entirely possible that a sample drawn from a population of about half and half males and females may have some other proportion of sexes, say 25% males and 75% females. You can equalize this tendency by multiplying all the males’ scores by 2, or the females scores by 0.66. In a sample of 100 subjects, that would give the males an influence os 25*2=50, and the females an influence of 75*0.66=49.5 – close enough.
And I suggest that, if you perform such an “artificial manipulation” on data, that you do everything to both the raw data and the transformed (in this case, weighted) data and see what difference it makes.

The calculus of “I don’t wanna” (I don’t wanna do calculus). Time and spatial series

A large subset of statistics, sometimes called econometrics or time series analysis is involved with looking for patterns in variables that change over time. “Econometrics” is one of the names because so much of economics has to do with changes in value.

The phrase “change over time” is key, so it shouldn’t surprise you that time series analysis is calculus heavy. Technically, you have to do calculus if you are going to deal with change – practically, not so much.

Mind you, I advocate understanding your tools, which means understanding calculus if you are going to work with time series, but people with a very solid background in calculus have worked out a suite of tools that crunch time series data without the user having to deal with calculus at all. You just have to know how to interpret your computer read outs.

Since we will be going into time series in much greater detail later (even with some calculus) I won’t go into it in any more detail here except to point out that a picture can go a long way. Look at this graph of intereruption intervals of Old Faithful.

This is a graph of the difference of the intervals and the mean interval (on the vertical axis) is order of the date of their recording (on the horizontal axis.) Can you see the pattern – roughly long-short-long-short… This is what gave scientists the idea that what caused the variation in intereruption times had to do with how Old Faithful’s underground chambers were filling with water.

An honorable mention: Nonparametric statistics

Most commonly used statistics capitalize on the knowledge of the nature of the probability distribution data is drawn from to characterize and make inferences about that data, but, sometimes, you don’t know what the distribution is like.

There is a class of statistics that do not rely on knowledge of the underlying distribution to characterize data. Such statistics are called nonparametric because they are not related to the parameters that describe specific distributions.

Nonparametric statistics tend to be more difficult to calculate than parametric statistics, but that’s not a problem if you have a computer that can do the work.
An example is the Mann-Whitney U test, used to compare two groups. It can be used in place of the parametric t-test, which is only appropriate for normal data. To use it, you rank all the data points together and separate all the ranks into which group they belong to. Generally, it tests whether a group has most of the high ranks. 

Many nonparametric statistics are based on ranked data. Most graphic methods are nonparametric simply because they are based on examination. Sampling methods use Monte Carlo methods (repeated sampling) to analyze data. Some methods test to see if the distributions two groups are drawn from are different without necessarily caring about what the distributions look like. Scale is another characteristic of data sets that might be used to design statistical procedures.

Obviously, there is a wide variety of nonparametric tests that serve as alternatives to the more traditional tests. Many nonparametric tests tend to be more robust than their parametric counterparts.

Quartile based descriptive statistics are nonparametric, as are information theoretic statistics.

We will delve much more deeply into nonparametric statistics as we look into inferential statistics in the next section and, later, when we look specifically at nonparametric methods.

The statistical decision tree

If you’re not sure which methods to use to figure out what’s going on with your data, there are various guides that can be used. There is the temptation to use them mechanically and not use problem solving skills that are at the center of statistics, but used with creativity, they can give you a start.

There are statistical decision trees that can be used to plan an analysis. A Guide for Selecting Statistical Techniques for Analyzing Social Science Data by Frank M. Andrews, Laura Klem, Terrence N. Davidson, Patrick O’Malley, and Willard L. Rodgers may be the most popular ever created. It was copyrighted 1981 by the University of Michigan and has been turned into an electronic version for use with the MicrOsiris statistics package (https://www.microsiris.com/Statistical%20Decision%20Tree).

There is also one intended for use with the DANSYS spreadsheet

(http://www.theriantimeline.com/statisticstree.ods).


Decision trees work pretty much like the identification keys you might have used in biology classes. You answer a question and that leads to a new question. At the end of a series of questions, you come to a dead end that gives you a method. 

Say, you have five schools designated by number 1 through 5 (that would be a nominal variable) and you have mathematics achievement test results collected from them (those would be continuous data) and you want to know if the student performance is different between the schools. How would you test that?

Using the DANSYS tree, my first question is “Data type.” The data is numeric, which leads to the question, How many variables does the problem involve?” I answer Two variable – one is Schools and the other is Math scores. For “Is time a variable?” I answer “No.”

Now it asks, “How do you want to treat the variables with respect to scale of measurement?” I have one interval variable and one nominal variable. That leads me to, “Is the interval variable dependent?” I could actually go either way with this, seeing whether the score will predict which school a student is from, or whether the school will predict a student’s performance. I think I will answer “No,” and see where that leads me.

When it asks, “Is the nominal variable a two-point variable?” (It’s not since there are five schools), I answer “No.” and am lead to Linear Discriminant Analysis – which I figure is a reasonable result. If I had decided to make the nominal variable dependent, I would have gone to  “What do you want to measure?” (Significance of difference between the groups.) “Are you willing to assume that the intervally scaled variable is normally distributed in the population?”Since this kind of thing is usually normally distributed (if there is doubt, I could always look at a histogram.) I answer “Yes.””Do you want to test the equality of means or of variances of the dependent variable for different categories of the independent variable?”Ah, doesn’t matter to me at this point so I answer, “Variances” and am lead to…Analysis of Variance, which I consider a reasonable decision.

But, again, I warn that mindlessly following a mechanistic tool like a decision tree goes against the spirit of statistics, which is a creative, problem solving discipline, and will, at times, just, flat lead you wrong. What if I had assumed the math scores to be distributed normally and they were not? The answer to my study question would have been wrong.

Going out on a limb – analytic problems that aren’t usually called “statistics”

Remember, a statistic is a value that characterizes a sample. Not all data are values drawn from samples and not all research has to do with characterizing samples and populations. Here are a few (not by any means “all”) other kinds of analyses that are not usually encountered in statistics courses.

In the social sciences, not all subjects can easily be quantified and, if there are some quantitative method that can aid in a study, it might not be able to capture the full content. For instance, for a field ethnologist trying to capture the content of a new language, they can take notes regarding how individuals use certain words, and they can use card sorting techniques to more accurately grasp the meanings of new words. Certain  mathematical techniques will help organize the cards, but nonmathematical pattern recognition will be the primary tool of the ethnographer. These nonmathematical techniques make up that qualitative analyses of the social researcher.

What is the best way, the quickest, the least expensive, the maximum, the minimum, the optimal? These are the kinds of questions addressed by operations research and the mathematical procedures used are those of optimization – rich in matrix methods and calculus.

Demography seeks to characterize populations – size of populations, growth and decay, mobility, constitution. Although demography is similar to statistics and, indeed, use many of the same methods, there are also very different methods used by demographers. The national census is an exercise in demography.

In manufacturing, there has to be methods to keep up with the quality of the product produced, defects that occur in production, conformation to design. These are the subjects of quality control.

Decision theory investigates the best ways to  make decisions – especially business decisions but and goal directed behavior. Decision analysis evaluates past decision making and seeks to optimize future decisions.

Public utterances are not always easy to decipher. Logical analysis attempts to uncover what people say and what they mean. Statements are evaluated according to their validity and persuasiveness. Rhetorical analysis also looks at delivery, emotional content, and context.
 
Often it is instructive to study, not an object or phenomenon, but a model or simulation. You can’t bring a river or wolf pack into your laboratory and, where field research yields a wealth of knowledge, there are things that can be learned by feeding what you know into a computer and letting it take over.

Since the middle of the 20th century, a solid goal of science has been to create a thinking machine. One motivation is to understand the older more organic version – the brain. That would be another form of simulation. But a second goal would be the creation of a synthetic helper capable of reasoning.

Huge amounts of data flow through modern computers and the most massive body of information is, of course, the Internet. Keeping up with this incredible volume of information is a monumental task but one that modern society has come to depend on. There is a constant search for faster and more efficient methods of data storage and handling – cryptography and data security, data structures, data searches, and data mining are all of paramount importance in  modern analytical techniques.

And as data flow increases, speed and the prevention of data loss in transmission also becomes vitally important, and that is the subject of information theory and signal analysis.

Where mathematicians can use their skills in rearranging mathematical concepts to manipulate formulas into useful forms, computers rely on brute force number crunching to produce answers, but numerical methods must  be developed that they can use and that is the subject of numerical analysis.

For better or worse, modernity is driven by value and money and a body of analytical knowledge has formed into the twin subjects of economics and finance.

The world is not a laboratory and the processes that drive it are complex and chaotic and given to sudden, irreversible catastrophes. Our ability to predict much of what is around us is necessarily limited – what those limitations are and our ability to work within them are the subjects of chaos, complexity, and catastrophe analysis. Our need to predict as best we can processes from earthquakes to heart attacks drive these disciplines.

Game theory addresses behavior in business and gambling – any situation that entails risk – and seeks to elucidate what that behavior is and what it most optimally should be.

Sociometry and ethnography deals with describing cultures and social hierarchies. Analytical methods have been brought to bear on these studies.

In science and industry, the fine points of measurement are of supreme importance. Measurement theory and dimensional analysis are at the back of much of the scientific knowledge we have today – from Galileo to Newton to the Higgs Boson, you have to measure it to understand it.

So, the world of analysis is far bigger than just statistics.


Design a site like this with WordPress.com
Get started