Stat Files

One of my real joys in life is statistics. It’s one of my hobbies. I’m programming routines for my OpenOffice spreadsheet. If I live long enough, I’ll be able to do analyses of variance, advanced linear analysis, content analysis and all kinds of stuff. I also collect data for the Were community.

Professor Patrick Allen (The Learning Company – The Art of Teaching) in giving an example of a class that (of course) students don’t like, came up with “statistics. That sorta depresses me. You see, I think everyone should enjoy statistics.

I think I understand why the one class everyone dreads is statistics (and in college, everyone gets to take it, right?) For one thing, instructors in statistics classes typically either know statistics but have a problem getting the ideas across to students, or they’re really good teachers but don’t quite understand their subject. Also, people have the wrong idea about statistics. They think it’s a mathematics subject, but it’s not. Mathematics is just the tools that statisticians use. Statistics is a conceptual subject about getting to the bottom of things.

Everybody should understand statistics. If you’re going to understand the world, you need to understand statistics. We live in a statistical world. Things are pretty straightforward in the chemistry and physics lab. Machines generally behave pretty well. But just think about your own body.

Each cell is a bag of chemicals just thrown together. It’s not like you have a bunch of test tubes with a few orderly reactions going on. You rely on bags of chemicals to live, because that’s what life is – chemical reactions going on, all in the same bag. Try to sustain any reaction for long in a ziplock bag, especially anything like the hundreds of different reactions going on in a cell at the same time in the same place.

I used to collect minerals; I might again if I get the chance. If you pay attention to the rocks around you, you will soon be amazed at the beauty beneath your feet. Try to make azurite in your kitchen. You can, but it isn’t easy. Just try to cook up a quartz crystal – maybe an amethyst. But those things fall together naturally in the ground.

These are complex phenomenon. When you start trying to figure out how they take place, you quickly lose track of all the things that have to happen.

When you leave the laboratory, things get chaotic. People have the wrong idea about chaos. Things run on chaos. We’re not talking about purely random havoc here. Chaos is the finely tuned mechanisms of the universe, so finely tuned that a tiny difference in conditions now (so tiny that they can’t be measured – ever – they’re way beyond anything that can be resolved) can make huge differences just a little down the road. That’s why your weather man is so hard put to get it right. It’s not his fault; he’s just having to deal with highly chaotic stuff.

So how do you deal with complex, chaotic stuff? – Statistics. You deal with the probabilities.

You read articles about scientific discoveries all the time – do you really understand that stuff. It’s making a lot of the news, so you should. But scientific journalism is full of all kinds of holes. Not long ago, I heard about some scientists who had found a faster than light particle. One eye brow cocked up skeptically. I was amazed at how many people were taking them seriously. You see, there are very few barriers quite so unbreachable than the speed of light. If someone does break the light barrier, they have a lot of work to do to prove they’ve done it. Of course, they didn’t.

But science journalism is not where you learn about scientific progress – you do that by reading the primary research papers and right smack dab in the middle of all that is – you got it, statistics. You will never grasp anything about research and you will be taken in by all the freaky hoaxes unless you know how to read research papers and decipher statistics.

You’re going to have a 50% chance of rain tomorrow. What does that mean? Does it mean that, if you go outside you can flip a coin to see if your going to get wet or not? Nope. 50% chance of rain means that it will be raining somewhere – 50% of the area around you will get wet. You just have to figure out which 50% it is.

I follow the Doonesbury website. The strips from back in the 70s make me nostalgic, but I especially like the “Say What?” column. The things that come out of the mouths of people who really should know better. The other day, I read this little gem:

“This was the sixth warmest December-January period on record and the warmest since 2012…Most of the contiguous U.S. was warmer than average for the two-month period…No state had December-January temperatures that were below the 20th century average.”
— National Climatic Data Center

Okay – think about it. If you don’t see the hilarity, you will if you stick with this column. (hint: averages are somewhere in the middle)

Again, Statistics are communication. In order to get many points across, you’re going to have to say it with statistics. There’s no chance you’ll avoid it – well, the chances are very slim.

And, honestly, statistics are beautiful. All those colorful graphs chocked full of information and, if you don’t understand statistics, all that cool information will be forever beyond you. Wow! Doesn’t that make you feel all angsty inside. Well, let’s do something about that. Come on and find out why groovy people like me find statistics so enthralling!

Data and statistics
So, let’s start with the absolute basics. What do statisticians work with? – Data!

Data is simply abstracted information. As soon as you take a thought or an observation and record it, it becomes data. Here are some examples of data:

1,2,3,4,5,…
“It was the best of times, it was the worst of time,…”
Charles Dickens, A Tale of Two Cities.
3,1,4,1,5,9,2,6,5,3,5,9
2730.69,2830.69,2934.65,2889.36
Four Dow Jones averages from 1991

The first, third and fourth examples are obvious and they are what most people think of when they hear the word “data”: numbers.

But what has Charles Dickens to do with data? This passage from A Tale of Two Cities is certainly abstracted information, but could it ever be of interest to a statistician – it could and it most likely has. The statistical field of content analysis takes textual materials, such as the works of Charles Dickens, and subjects them to mathematical analyses. The text turns into things like word counts and measures of how often specific words are associated with other specific words.

Content analysis can also be applied to visual media like the photograph.

Literally any form of information may come under the scrutiny of statistical analyses.

There are two kinds of data: raw data and statistics. (By the way, used to, “datum” was the singular noun and “data” was the plural. “Datum” has dropped out of favor now and we have “a data point” and “data in a data set.” And you say “day-tah” and I’ll say “dat-ah” and we’ll work the whole thing out.)

Raw data is the individual observations, measurements, values, texts, pictures, sound files as they are recorded. Once the data are modified to try to make sense out of them, they become statistics. Sums, averages, squares and square roots are all statistics.

There are two kinds of statistics. Statistics which serve to describe data (either raw or processed) are called descriptive statistics. Statistics that are used to draw information about other groups of data than the data you have are called inferential statistics. It’s common for people to try to make sweeping statements about a large group by looking at a smaller group drawn from it, and there are ways of doing that so that those sweeping statements are warranted. People also look at groups together in order to find out more information that could be derived from just one of the groups alone. In these cases statisticians are making inferences about one group from what they see in another group.

Descriptive statistics are certain. If you calculate the average of a group of numbers today and you calculate the average of the same group tomorrow, it will be the same (assuming, of course, that you did it right both times.) The average is unchangeably the average. But inferential statistics always carry the caveat: “This result is probable at the xyz level of probability.” A commonly asked question in statistics is whether two groups are from a single population or from different populations. It can never be said with absolute certainty whether two groups are the same or different, but we can very often assign a probability to whether they are different or not.

Data tell stories and statistics are the stories they tell. One of my favorite books on statistics is A Casebook for a First Course in Statistics and Data Analysis, by Samprit Chatterjee, Mark S. Handcock, and Jeffrey S. Simonoff. It contains a lot of stories about Old Faithful, and exploding space shuttles, vineyards and discrimination (or maybe not – that one is a mystery). Or look at this excellent website http://lib.stat.cmu.edu/DASL/. Data and stories. The stories are statistics.

Reading, writing, and arithmetic – the why.
Public schools have taught these subjects, the Three R’s – for a long time. The usual justification is to create well rounded citizens. But why these three subjects particularly? There are certainly other very foundational subjects, history, health, geography, government – why are the Three R’s always mentioned as the core of education? And why should statisticians value these subjects?

I have an answer.

Humans are, at base, social animals and, if their education is to reflect this very center of human nature, it needs to very fully prepare them to be good communicators; therefore, reading and writing for literacy, and arithmetic for numeracy. For the human race to progress, people need to be able to effectively communicate their ideas.

I have said that statistics are communication. The job of statisticians is to mine the world for information, elucidate the connections between the information, make sense of them, and communicate what they find to their employers, clients, and/or the world. To do that, it is critical that they be able to effectively write reports and present their information verbally. Of course, much of what they have to communicate is mathematical but there is one more point for mathematics.

Statisticians are very centrally problem solvers. There is no better problem solving workshop than mathematics. Word problems provide students with a vast variety of problems to solve and gives them many opportunities to see the many modes of attack on problems.

But a statistician’s education really needs to be much more well rounded than that. They really need to know a little about everything because they never know what the content of their next project will be. The Casebook by Chatterjee, Handcock, and Simonoff mentioned above goes from geology to policy to finance to the entertainment industry to manufacturing and back to policy, just in the first chapter. Can you think of a more exciting field? The whole world would be your workshop!

The ethics of statistics – Lies, Damn Lies, and Statistics.
“Figures often beguile me,” he wrote, “particularly when I have the arranging of them myself; in which case the remark attributed to Disraeli would often apply with justice and force: ‘There are three kinds of lies: lies, damned lies, and statistics.’” That was Twain, but there’s no evidence that Disreali ever said it. Still, when Twain wrote about lies, he included himself in the list of the perpetrators. Statistics help us to lie to ourselves. For instance, before 1997, it was commonly believed that stomach ulcers are caused by stress and it was fairly well proven that the primary cause was Helicobacter pylori bacteria in 1982. There were reasons for believing the lie.

There were experiments in the 1950s with monkeys and shocks that seemed to definitively point toward stress as the culprit The results were very significant – statistically.

The moral – you can have statistics, but you still have to interpret them.

Chatterjee, Handcock, and Simonoff (in A Casebook for a First Course in Statistics and Data Analysis) tells a story. (Remember, statistics are stories).

Since the 1950’s, the United States has been captivated by the plight of minorities. Civil rights is still a big issue in Washington. School integration is still a big issue. So, in 1994, statistics were applied to the Nassau County School system (Long Island, New York) to see if their schools were racially balanced – if the school populations reflected the racial mix of the communities. The data was treated as a binomial distribution (white student=p, minority student=q). And it was found that the distribution didn’t look very biased. The distribution had a long tail with four school districts showing as outliers (the white student populations were far below where they “should have been.)

Had the story stopped there, the government would have urged the schools to correct the situation. But wait! (as the late Paul Harvey would say.) Here’s the rest of the story.

When the researchers looked at how the school districts had been set up, they found that a slight shift in geographic boundaries would greatly reduce the imbalances. They had originally tried to model the racial constituents of the school systems as a binomial distribution and it turned out that the distribution was more complicated and, when they took the complexity into consideration, there was little racial imbalance. If they had stopped with the first run of statistics, the government would have been trying to force Nassau County schools to address issues that did not exist.

Statisticians are professionals. The well being of real people is contingent on their expertise. Their errors can be destructive. Ethics, in other words, is as much an important issue with statisticians as it is with doctor, politicians, lawyers, or any other profession.

And, again, it is important that everyone is, at least at the basic level where they can rationally interpret statistics for themselves, statisticians. That way, statistics won’t be in the super-category of lies.

Sampling

In 1936, the Literary Digest carried out a poll of it’s readership to determine who would win the upcoming presidential election. They surmised from the results the Kansas governor, Alfred Landon, would win by a landslide. Do you remember a president named Alfred Landon? No? Well, that would be because Franklin Delano Roosevelt carried 46 states in that election. It was a disastrous defeat for Alfred Landon and it was a disastrous defeat for the Literary Digest. The magazine was discredited and after 40 years, folded.

What happened? Many of the Literary Digest’s subscribers were from two of the states that Landon carried – Maine and Vermont. Even though the sample contained about 2.4 million respondents, they were mostly well educated people with incomes well above the national average (they could afford a literary magazine subscription), and the sample members that were not subscribers to the Literary Journal were from two other convenient sources – automobile registrations and telephone lists. Not everyone back then could afford automobiles or telephones.

George Gallup’s American Institute of Public Opinion poll used a sample of only 50,000 people and predicted the result of the election to within 1%. So Literary Digest was down and the Gallup Poll was up (I’m sure you’ve heard of the Gallup Poll – this was where it really got it’s start.)

So statistics can make or break a company. What was right about Gallup’s Poll? He made sure that his sample was representative of the voting population. He studied what the typical voting American looked like and made sure that his sample looked like that.

Look at the research that has been done to date concerning the Were community. Most of the samples have been from online forums. So the results can be generalized to people who use the Internet for social interaction and who identify as Were/Therian (not, mind you, people who necessarily are Were/Therian). Can such sample possibly be representative of the Were community which also includes a vast body of people who do not care to use the Internet for social interaction?

Are the researchers wrong? Not necessarily. They’re doing the best that they can at this time with what they have available and, as long as they keep in mind the weaknesses of their results, they’re learning how to deal with their particular research problems and finding productive avenues for future research.

There are several issues involved with sampling. Representativeness is only one. To achieve representativeness, you need to be able to draw a random sample from the whole population, meaning that every member of the population has the same chance of being picked for the sample. Barring that (and much less preferably), a sample can be constructed so that the individuals drawn look as much as possible like the typical population member. If you can manage to construct such a model person, you can make sure that people are drawn that look like that model person. For instance, if you think that gender is going to be an issue, you can make sure that the proportion of different genders in the sample is the same and in the population – this is called matching and for it to work, you have to be reasonably sure that your model actually is representative of the population.

Another issue is size. I can generate random number with my OpenOffice Calc spreadsheet that come from a normal process. We’ll be talking about random distributions later but for now, I’ll say that the bell curve (that you may have been graded on in school) is a graph of a normal distribution, and is a very common pattern of probability in nature and that most traditional statistics assume that the data they process are from normal distributions. If I generate 10 normal deviates (that’s what a random number from a particular distribution is called), the resulting distribution will not look random at all. Data tends to cluster around a central point and if there are only a few values, they are much more likely to be close to the central value than further away. Also, error and random variation can greatly alter a small set of data. If I generate 100 normal deviates, their distribution will look a little more normal. If I generate 1000 data values, they will probably look normal.

In short, a sample needs to be a certain size before it can show what the distribution it was drawn from looks like, and the name of the game is figuring out what a huge population looks like by looking at a smaller sample taken from it. There are formulas that are available to determine the size sample that is needed to study a particular sized population with a certain distribution pattern.

Another issue is whether a sample is actually from one homogeneous population or whether the population is composed of more than one subgroups. One of the problems the National Park Service had in trying to predict the times of the Old Faithful Geyser in Yellowstone National Park was the fact that there were actually two kinds of eruption and, when they started looking at the data, it was obvious that there was two kinds of data present. When they separated the two kinds of data, they began to get a much more accurate picture of how Old Faithful works.

Losing information

Ironically, although statisticians make a living collecting data, everything they do to data causes them to lose information.

Reification is a constant temptation for researchers and statisticians alike. I’ve read many good books on research methods and a constant reminder is present, “Never forget, the data is not the thing.” You can’t capture the entire event. You have to pick and choose what to record and hope that you got most of the relevant data. To forget that is to misapprehend the complexity of reality.

Data is messy, but as soon as you table it, you’ve forced an unnatural order onto it. Real life information doesn’t come neatly packaged in tables, so you’ve lost the natural ordering right up front.

When you create a frequency table, you only reserve the midpoints of intervals of data. From those, you can retrieve fair estimates of summary statistics like means and medians and standard deviations but one thing you’ve lost is the actual minimum and maximum. You have lost the endpoints of your data set.

Another issue is level of measurement. Qualitative measurements are measurements that are recorded as words – red, green, fast, slow, big, small – the way they are usually quantitized is by counting them – there are 5 red flowers and 3 green ones. So there’s a lot of overlap between qualitative and discrete data (data that can only take certain values within an interval), but they’re not necessarily the same thing. If you are measuring the size of particles of sand using a series of grids, your measurements will be discrete since you only have a few sieve sizes.

The other kind of data is quantitative – the measurements are in the form of numbers with units – five feet and 6 inches tall, 175 pounds, 36 miles per hour. And quantitative data is usually (but not always) continuous. With continuous data, you cannot name two different numbers that are so close together that you can’t fit another number between them.

Traditionally, there is a kind of hierarchy of data. There is nothing sacred about the structure and it has been challenged recently, but its still useful as a way of conceptualizing kinds of data.

At the bottom, there are two classes of qualitative data – nominal and ordinal. Nominal data is “nominal” because it is collected as words – names. Male/female, red/green/blue, tall/short. There are forms of qualitative analysis that never uses numbers, but most analytical procedures for nominal data uses counts. “5 out of 15 men responded poorly to a new drug but none of the 15 women had adverse reactions.”

Ordinal data, on the other hand, shows some order (hence, the name). Regardless of past attitudes, you can’t really assign a precedence for gender. Men are not more than women or vice versa. But you can assign an order to birth order – the first child came first, the second came second, and so on. The classical example of ordinal data is the Mohs hardness scale for minerals. Gypsum (2) is a little harder than talc (1) but diamond (10) is far harder than corundum (9). In fact, although corundum is twice as hard as topaz (8) diamond is four times as hard as corundum. The scale, in other words, does not have even intervals. Most famous scales are like this – most of the scales psychologists use, pain scales, the Lenowill shifting scale, etc.

Some scales, though, are very mathematical, for instance, the pH scale in chemistry is determined by the number of free protons in a solution. The “top” two kinds of data are definitely quantitative and continuous. The difference between interval and ratio data is that, although they both have defined intervals, ratio measurements are referred to an absolute zero. That’s why scientists have moved away from the metric Centigrade scale for temperature in favor of the Kelvin scale.

On the Centigrade scale, 40 degrees is not twice as hot as 20 degrees. In order to make ratio comparisons like that you have to have a “bottom point” on the scale that all the other measurements relate to. There is no absolute zero point on the Centigrade scale. But there is on the Kelvin scale. When Kelvin says “absolute zero”, he means it. Since temperature is molecular motion, when all molecular motion stops, there is zero temperature and the zero point on the Kelvin scale is where all molecular motion stops. That’s why you can say that 40 degrees on the Kelvin scale is twice as hot as 20 degrees.

The point is that, as you move downward through the kinds of data, you get less and less information. At the top, you retain an absolute lower limit, equal (or mathematically definable) intervals, order, and distinction between categories. At the interval level, you lose your zero point. At the ordinal level you lose the definable limits, and at the bottom, all you have left is the names of the categories.

It is a tenant in research and statistics that, if you can plausibly measure what you are observing at an interval level, you don’t measure it at the ordinal level – you use as high a level of measurement as you can.

Sometimes, you have to measure a quantity as discrete data when you are actually recording continuous values. For instance, depression is continuous. If you observe two depressed people, one is always going to be more depressed than the other and you can always find someone that is less depressed than one but more than the other, but if you have seen the way that people measure depression, you will know that it’s almost always scales and is always ordinal or nominal. Why? because psychologists have to rely on subjecting, self-reported data, which will never be precise enough to allow for interval data.

It is also sometimes convenient to have both the highest level data you can get and then convert it to a lower level – but never, ever throw away the higher level data. You never know when you will need it. You can always measure anthropomorphic data on an interval or ratio scale – it’s all inches and centimeters. But sometime, especially when you’re just trying to get a feel for the data, you may want to just divide your subjects into “tall”, “medium”, and “short”. Sometimes there are statistics designed for a lower level of measurement that would shed a different, useful light on your data or a procedure that would just be fun to use.

I guess what I’m saying is that, in collecting data you always lose information – that’s unavoidable – but you never give ground.

Randomization

Rarely will anyone study a whole population directly. Usually, they draw a smaller sample from the population and study the sample, and they will see if it makes sense to generalize from the sample to the population. Whether they can do that usually depends on the size of the sample and whether the sample looks like the population it was drawn from. A big factor in whether the sample will look like the population is the way it is selected.

First, there’s the population, such as the whole population of a big city. Then, there’s a sampling frame – something that the sample can be physically drawn from, such as a telephone book of the city. Notice that the telephone book might miss some people: homeless people, people living temporarily with relatives, poor people who can’t afford a telephone, etc. Finally, there is the sample – the individuals who are actually studied.

And how are they drawn? The best way is randomly. The hope is that if individuals are drawn randomly, all the relevant factors will be accounted for and all the irrelevant factors will cancel out – by chance. If the sample is large enough and is drawn so as to be a cross section of the population, that is usually a reasonable assumption.

So, what is random? By definition, a random sample is a sample drawn so that every individual in the population has the same chance of being picked. In reality, we try to get as close to that as possible. To draw a random sample, each individual in the sampling frame is assigned a different number, 1 through whatever. Then random numbers are generated in the range of 1 to the largest number in the list of individuals. As they are generated, the individuals with those numbers are selected and taken out of the pool (that is called sampling without replacement – occasionally individuals are chosen with replacement – they have the opportunity to be selected more than once). Drawing stops when the planned number of individuals are drawn.

Now, there may not really be a such a thing as a perfectly random number. Everything in the universe may be determined by complex causes and effects and, if so, everything has a different chance of occurring; therefore, even what we call “random numbers” may have different probabilities of occurring, but there are near random processes that give something like random numbers – actually “pseudo-random” numbers.

But….but, is there are such a thing as randomness. At an atomic level, yes. Quantum physics is not deterministic; it’s statistical. It is entirely possible that every air molecule where you are may find itself at one side of the room you’re in, in which case, you’re in trouble, but don’t worry, the chances of that happening are very, very tiny. Radioactive isotopes decay by splitting off pieces. We have a pretty good idea why they do so but we have no idea why they do so when they do. We do know that, in 20.334 minutes a piece of matter with 15 grams of carbon-11 will only have 7.5 grams of carbon-11. 7.5 grams will have decayed into some other element. But there is absolutely no way of knowing which atom is going to pop. If there is some determinate mechanism for the decay of a radioactive atom, it’s nothing we will ever be able to detect.

When I was in Pharmacy school, we had to buy a book that was nothing but a list of the digits of pi. Even now, I have a book of 1000 pages, each page containing 10,000 digits of pi. The way you use such a book is, flip to a “random” page, then put your finger down somewhere on the page. That gives you a block of digits – 10 by 5 digits. You can, then, drop a pencil point into the block to get your number (or the start of a number if you need a multi-digit number) or you can throw a 10- and then a 6-sided dice (yes, we had RPGs when I was in college). Tedious,tedious. Today we have computers that can easily throw you a random number of as many digits as you want. The RAND() or RND() or RAN() functions of computer programs or spreadsheets such as Microsoft Excel or OpenOffice Calc will usually return a random number between 0 and 1. Just multiply that by 10 and take the integer portion or the result and you have a random number between 0 and 10. Over the long run, these random number generators do not play favorites – each number has about the same chance as any other number of appearing.

Dice, coins, wheel spinners – they’ve all been used.

So, how good is Calc’s RAND() function? If it does the job that it purports to do well, then every value it spits out has the same probability of being spit out as any other value. A histogram graphs the number of values within each interval from largest to smallest value. If, in fact, each value has the same chance of appearing, the resulting graph should be flat on top. You can do the experiment yourself by placing =RAND()*10 (it works for Excel, too) in a column of 1000 cells and using the FREQUENCY function to sort out how many values are between 0 and 1, 1 and 2, and so on. Then make a bar graph of the frequencies. Start with the first 10 frequencies. Then make another bar graph of the first 100 frequencies, then the first 500, and, finally, all 1000. If you don’t want to do the work, here are the results I got:

The distribution you’re after is called a “uniform continuous distribution” because any real number is possible. For a uniform discrete distribution, you could use the spreadsheet formula =INT(RAND()*10). How do the histograms look? Are they flat on top? What happens as the sample sizes get larger? Does this simple experiment give you an idea of why larger samples are better than small samples?

There was an ongoing discussion as to why OpenOffice doesn’t improve it’s random number generation. If there ever was a problem with RAND, it seems to have been resolved. Samples of a reasonable size seems to be quite acceptably uniform.

The world is real and, therefore, measurements of the world are real – real numbers, including numbers that can’t be represented exactly with binary numbers (which is what computers think with). Use a complex routine with lots of iterations and or divisions and let the spreadsheet round the results a few times and you’re bound to lose values, most noticeably at the end of intervals.

So, is a random number generator possible (not a pseudorandom number generator – a random number generator). At an atomic level – sure. All you have to do is sense the decay of each atom if a radioactive element as it pops – or sense the hit from background cosmic radiation – or use an unstable semiconductor device. I just looked one up on Amazon and it costs less than $50.

The moral is that, if you’re doing serious statistical work with serious consequences – know the weaknesses of your equipment and, if you wonder about where inaccuracies might crop up, test the routines by feeding in known data to see if your results match the known results.

Simple sampling techniques

The simplest form of sampling takes each member of a sampling frame in turn and, if it meets some random criterion, it is drawn until a predetermined number of individuals are picked for the sample. There are several ways to do that.

I programmed a routine to perform this procedure for me in OpenOffice. It takes randomly selected individuals from a sampling frame stored on a spreadsheet called “ToSample” and transfers them to a spreadsheet called “Data”. It will sample with or without replacement. Here it is.

Sub Draw
‘REM Draw selects n2 individuals out of n1 with equal probability
‘Inclusion probability of ith element is n2/n1 if w/o replacement
‘1-(1-1/n1)n2 for w/replacement
‘Sampling weight of ith element is n1/n2.
‘Sample from ToSample to Data
‘f is sample fraction if<1; otherwise, number drawn.
DIM f as double, n1 as integer, n2 as integer, w as double
Dim aDoc as object, Sheet1 as object, Sheet2 as object
Dim oCell1 as object, oCell2 as object, oView as object
Dim o as integer, p as integer, rFlag as boolean
Dim Trans(), i as integer, j as integer, k as integer
Dim U as double, t as variant, Hit() as integer

f=Val(InputBox(“Enter sample fraction or number of individauls to draw”,”Draww”))

‘Select and identify the end cells of the database.
CurrentRange(False)
Redim Trans(1,RangeAddr(1,3)-RangeAddr(1,1))

‘Set up initial point for loading the sample set on Data sheet
‘starting at A1
aDoc = ThisComponent
oView=ThisComponent.getCurrentController()
Sheet1=oView.getActiveSheet()
Sheet2=aDoc.getSheets.getByName(“Data”)
‘oCell1=Sheet1.getCellByPosition(0,0)
‘oCell2=Sheet2.getCellByPosition(0,0)

‘Keep track of next position with (o,p)
‘Initialize o,p
o=0
p=0

rFlag=Val(InputBox(“If sampling is to be with replacement, enter 1; otherwise, enter 0″,”Draw”)
    ‘If f<=1 then n=round(f*N), if f>1 then n=f
    n1=RangeAddr(1,4)-RangeAddr(1,2)+1
    if f<=0 then
        msgbox “Sample fraction must be between 0 and 1; number of individuals must be an integer greater than 1”
        exit sub
    elseif f<=1 then
        n2=int((INT(f*100)/100)*n1)
    else
        n2=int(f)
    end if

‘If without replacement…..
if rFlag=0 then

    ‘Set k=0 and i=0 and start data scan
    i=0
    ‘For k=1 to N
    for k=1 to n1

‘Get a population unit and set k=k+1. If no more
‘population units, then terminate

        ‘Test if kth unit should be drawn. Generate a uniform
        ‘random number (0,1) U
        U=rnd()
        ‘if (n-i)/(N-k+1)>U select kth unit and set i=i+1
        if (n2-i)/(n1-k+1)>U then

            for j=0 to ubound(Trans,2)-1
                oCell1=Sheet1.getCellByPosition(j,k)
                Select Case oCell1.Type
                    Case com.sun.star.table.CellContentType.EMPTY
                           Trans(1,j+1)=0
                    Case com.sun.star.table.CellContentType.VALUE
                           Trans(1,j+1)=oCell1.Value
                    Case com.sun.star.table.CellContentType.TEXT
                           Trans(1,j+1)=oCell1.String
                End Select
            next j
            for j=0 to ubound(Trans,2)-1
                oCell2=Sheet2.getCellByPosition(j,i)
                if IsNumeric(Trans(1,j+1)) then
                    oCell2.Value=Trans(1,j+1)
                else
                    oCell2.String=Trans(1,j+1)
                end if
            next j
            i=i+1
        end if

        ‘if i=n, terminate. Otherwise….
        if i=n2 then exit sub
    ‘next k
    Next k
‘If with replacement then……
else
    ‘set i=0 and initialize all hit counts to zero

    ‘Hit count matrix is 1xN
    Redim Hit(1,n1)
    for j=1 to n1
        Hit(1,j)=0
    next j
    ‘Fr i=1 to N
    For i=1 to n1
        ‘Generate an integer k between 1 and N uniformly
        k=int(Rnd()*n1)
        ‘Increase hit count of kth population unit by 1
        if k=0 then
            Hit(1,1)=Hit(1,1)+1
        else
            Hit(1,k)=Hit(1,k)+1
        end if
        ‘If i=n, then terminate, otherwise,….
        if i=n2 then goto Draw1

    ‘next i
    Next i
    ‘select all units with hit counts greater than one.
Draw1:    t=0
    For i=1 to n1
        if Hit(1,i)>0 then
            for j=0 to ubound(Trans,2)-1
                oCell1=Sheet1.getCellByPosition(j,i)
                Select Case oCell1.Type
                    Case com.sun.star.table.CellContentType.EMPTY
                           Trans(1,j+1)=0
                    Case com.sun.star.table.CellContentType.VALUE
                           Trans(1,j+1)=oCell1.Value
                    Case com.sun.star.table.CellContentType.Text
                           Trans(1,j+1)=oCell1.String
                End Select
            next j
            for j=0 to ubound(Trans,2)-1
                oCell2=Sheet2.getCellByPosition(j,t)
                if IsNumeric(Trans(1,j+1)) then
                    oCell2.Value=Trans(1,j+1)
                else
                    oCell2.String=Trans(1,j+1)
                end if
            next j
            t=t+1
        end if
    Next i
end if
End sub

When sampling without replacement, it looks at each individual in turn, assigning a random number to each. It then compares the number of individuals needed to fill the rest of the sample (after the next individual is picked) to the number of items left in the sampling frame (as a ratio) and then compares it to the random number to see if the ratio is larger – if it is, that individual is selected and taken to the “Data” spreadsheet. It keeps going like that until all the individuals needed for the sample is picked.

When you start the routine (if you install it into your OpenOffice spreadsheet, you’ll need to assign it to a toolbar button or menu command so you can start it up.) it will ask you for a number to tell it how many individuals to pick. If the number you give it is 1 or less, it will return a sample which is that fraction of the sampling frame. If the number is greater than 1, it will return a sample with that number of individuals.

If you sample with replacement, the routine generates a random number between 0 and the number of items in the sampling frame. The item with the same number as the integer portion of the random number gets a point. When you have enough items with at least one point to fill your sample, the program stops and transfers all the items with at least one point over to the “Data” page.

In both cases, each item in the sampling frame has an equal chance of being picked as any other item. It’s very democratic. What you end up with is one group of items selected from the sampling frame. You’d better hope that your sampling frame really is representative of the population it is meant to represent.

There are more complex sampling procedures that give you more control over your sample.

Complex samples

Beyond just picking random items out of a sampling frame, you have many more options. They increase the complexity of the sample and, therefore, what you have to do to analyze the sample, but they also give you much more control over your sample and allows you to look into the finer detail of the population that you’re studying.

One thing you can do is take a sample of a sample you took from the sampling frame. You can do that as many times as you want. You might want to do that if you’ve taken more than one group of data from a huge sampling frame and you don’t want to go back into the sampling frame to get more samples. For instance, you may have sampled several cities to take a survey. You may not want to survey the entire cities (New York, for instance). You can divide the city up into regions and sample those regions, or you may even want to divide the regions into blocks and take a sample of blocks, and so on.

Multi-stage sampling is not usually as accurate as simply random sampling but it may be worth the small loss in accuracy by being much more convenient, quicker, and less costly, and it typically improves the accuracy of another form of probability sampling – cluster sampling.

Cluster and stratified sampling are similar. The difference is that, in stratified sampling, you draw individuals from all the different layers you’ve divided the sampling frame into; whereas in cluster sampling, you only look at the particular clumps you’ve taken out of the sampling frame.

When cluster sampling, the population is divided into convenient, but pretty much homogeneous sections, such as cities or congressional districts, and a random sample of these smaller groups is selected; then individuals are randomly selected from these clusters. Cluster sampling is a way of making very large populations ore manageable. Still, cluster sampling tends to cause more error in the observations than just about any other kind and it makes bias more likely. You still have to make sure that the clusters you select for sampling provide an ultimately representative sample of subjects for the study.

When using stratified samples, a population is divided into subgroups and samples are taken from all the subgroups. Stratification can be proportionate or disproportionate. In proportionate stratification, the sample size from a strata is proportionate to the population in that sample, so to find the size of a sample drawn from a strata, a researcher will divide the number of individuals in the strata by the number in the whole population being studied and multiply that by the total sample size.

Using special formulas, it’s possible to design disproportionate samples that either optimizes precision given a specified cost, or that optimizes precision given a specified sample size.

So, how can splitting up your sampling frame give you a better sample. It’s complicated but I’ll see if I can give you part of the answer. Say, you take a very large population – the whole population of the United States, for example – and you take a random sample of that. Every person in the U.S. has an equal chance of being represented. That means that every person in Lickskillet, Tennessee has an equal chance of being picked as any one in New York, New York. That’s very democratic, but does it lead you to a representative sample of the population of the United States? Well, no, because a representative sample of the United States would weight people in New York more heavily in New York. You want people from Lickskillet, but, if you want the sample to look like the United States, you’ll have more people from New York. So you can break the U.S. up into regions and sample from each region so that the size of each sample is proportionate to the size of the population in each sample compared to the population in the United States.

One kind of sampling that people often have problems grasping is probability-proportional-to-size sampling (or PPS sampling). It’s often used in business where sampled individuals (businesses) have very different influences on the population as a whole. In PPS sampling, a researcher can take the individual’s influence into account when sampling by assigning a “size” weight (that would be the size of the influence, not the size of the individual) to each individual in the sampling frame and then drawing samples proportionately based on the size weights. Perhaps it’s easier to see what’s going on if the influence groups exert on the population are due to actual size, although it could just as easily be volume of sales or some measure of popularity. To figure out the probability of selection ofr each individual, the size measure of that individual is divided by the sum of the size measures of all the individuals.

PPS sampling can be used with all kinds of complex sampling schemes to select strata, clusters, and individuals.

Tabling data

First stop as information becomes data is almost always a table. You observe a value, and you write it dow or type it into a spreadsheet – it’s tabled. Used to, numbers in a table were almost unintelligible unless they were some sort of lookup table like a multiplication table, a logarithm table, or a table for unit conversion, so the first thing a statistician wanted to do was graph the values.

With the advent of the spreadsheet, suddenly, there are a lot of things that can be done to raw data in a table.

Continuous data gets tabled a little differently than discrete data. This time I’ll talk about the continuous data and save the discrete data (counts) for the article after next.

If there are holes in the data (missing values), the holes can be plugged up in various ways. Many statistics programs will do that almost automatically. The trick is to find the method that will work with the statistical techniques you want to use. There are three basic kinds of data attribution (that’s what the “plugging up the holes” is called) that might be used.

Sometimes you can simply remove cases (or variables) that have missing data, but more often than not, you can’t just say, “those cases aren’t important, anyway.” Usually, you record a case because it seemed important to record and just because a number gets smeared or a corner was torn from a lab book doesn’t diminish the importance of that data point.

Another strategy is to place (you can’t really say that you replace something that wasn’t there to start with) a number in the hole that looks like the other cases – something that won’t knock the analysis too far off base. That might be an average of some kind.

A more advanced strategy is to look at all the other data and try to figure out what would have been there had the measurement not been lost. I’m sure you remember interpolating logarithms and values of trig functions in high school. Interpolation, extrapolation and prediction can be used to figure out, given the data that is there, about what a missing value should be.

Another thing you can do to tabled data is sort it. Once values are sorted you can immediately see the dispersion and center. The dispersion is how far the data varies and, in a sorted table, all you have to do is look at the maximum and minimum values (the first and the last in the table) to see that. The center is the value that all the other values orbit around. When we look at frequency distributions and histograms below, “center” will become much clearer. In a sorted table, the value right in the middle of the list is called the “median” and that is a common measure of centrality. In a sense, it anchors the data set.

Have you ever been in a group of people and noticed one individual that stuck out like a sore thumb – perhaps a person that was a head taller than every one else. It sorta draws your eyes doesn’t it. You start thinking things like, “Wow! I wonder if they play basketball.” Well, outliers do similar things to a set of data. Exceptional values – values that look like they don’t belong in a data set, can do some pretty drastic things to the outcome of statistical analyses. You have to figure out how you want to approach them because they may actually belong there. They may not just be a decimal point that was misplaced.

Again, if an exceptional value looks like an error, it may be appropriate to just drop that case or replace it with a more likely figure using one of the data attribution methods mentioned above. But, regardless, you certainly need to identify them at the first and keep track of them. It’s easy enough to do that with a spreadsheet. Most spreadsheets have what’s called “conditional formatting” and you can tell the program to let you know if a value goes beyond certain limits. Usually, you look at all the other values to figure out what a typical value looks like (later, you’ll learn about measures of dispersion like standard deviations and quantile ranges that give you an idea of how “well-behaved” data acts.)

Often, a statistician will analyze a set of data with and without outliers to see how the results are affected. They also track down the cases that produced the outliers to see why they are there.

One of the very powerful (and very complicated) features of modern spreadsheets is the pivot table. I don’t know why they are called “pivot tables” (Microsoft trademarked the single word “pivottable”) but I suspect that it has something to do with the pivot methods used in linear algebra to work with matrices. Once you get use to working with your spreadsheet’s pivot table utility, you can easily restructure a data table to sort, filter, group, and summarize data by using your mouse to drag and drop whole collections of data from one place to another.

In brief, statisticians today can play with data before it’s even out of the table.

Counting data

In essence, all numbers are counts and all mathematics is about counting. If you’re not counting items, you’re counting units (as in measurements). But, in it’s purest sense (one, two, three, four, etc.), perhaps the simplest statistic is the count. It’s also a very important statistic, since counting is just about the only way to measure things at the nominal level, so let’s look at counting in some detail.

You would think that counting is as simple as going through a pile of things and saying, “one, two, three, four,…” as you come to separate items, but there are actually many counting procedures used.

I was once an observer for a study in which I had to estimate the age and weight of children in fast food restaurants and then I counted the numbers of bites of food and sips of drinks they took in 5 minute intervals (That’s much harder than you might think since children have such a penchant for blowing bubbles in heir drinks). I had to do all that surreptitiously because, if their parents realized that their kids were being watched, they would have complained to the managers of the restaurants and that would have been the end of the study. The counting technique I used is called frequency counting or frequency recording.

In frequency counts, a tally is kept of some event during set time intervals. For short time period, a sheet of paper ruled in a grid can be used – each grid is used for tallies in one time period. For longer time periods, a form in a table format might be appropriate.

In frequency counts, the time periods remain constant; in rate counts, the time period varies but the number of tallies remain constant. For instance, you might decide to see how long it takes for 10 vehicles to drive by your house. You start timing when one vehicle passes and count until the tenth vehicle passes; then you note the time. Then you might time several counts of 10. The time intervals, then, would be your recorded data.

Duration recording measures the length of time it takes to complete a task or how long an event takes. That kind of recording is pretty common in sports where, for instance, runners’ times are often recorded.

Interval recording measures whether an event takes place within a set time interval. The total observation time is divided into equal time intervals and one count is made for any time intervals in which the event occurred (regardless of the number of times it occurred during the time interval.)

Interval recording requires the observer’s undivided attention; time sampling does not. For time sampling, the observation interval is divided into equal intervals, but the event is only counted if it occurs at the end of an interval. This technique is most accurate for events of long duration since there is a greater chance of “catching” each time it happens.

Anecdotal records or case observations can include various kinds of counts. Here, an observer simply observes an event and takes down a written account that contains everything that seems to be important.

Counts per time duration are ratios. They can easily be changed into percentages by multiplying by 100. So, if it rains 20 minutes out of an hour (20/60=0.333…), it rains 33.3 percent of the hour.

Ratios and percents are just parts of whole, so, it 13 out of a hundred surveys were filled out, you have a ratio of 13/100 or 13%. The nice thing about percents is that they are comparable. If one student answers 15 out of 20 questions on a test correctly and another answers 17 correctly, the difference isn’t immediately obvious – 15/20 versus 17/20. But 75% versus 85% is much easier to understand at a glance.

Averages are also counts. When you read that there are 6,390 deaths per hour in the world, it doesn’t mean that every hour, 6,390 people die. It means that, of you had to make a guess as to how many people will die in a given hour – that would be your best guess. We’ll most certainly be talking more about averages later.

If you are interested in how long it takes something to happen (for instance, how long it takes you to get your hamburger and fries at the drive-in window of your favorite fast food restaurant), then you will want the latency time. So you drive up to the window and start your stopwatch. When the checker hands you your bag of food, record the time and that’s the latency time. (You might also check in the bag to make sure you got what you ordered but that’s not part of the latency data.)

Another special time of count recording is placheck ( short for “planned activity check”). If you use placheck to see how many people are doing what they’ve been asked to do, you would set a time interval and, at the end of that interval, you would count the number of people busy on the activity and record the tally with the total number of people in the group. You might set up a series of equal time intervals and do a placheck at the end of each interval; that way, you can keep up with how long it takes to complete the activity.

Like all measurements, counting requires equipment. The most accessible equipment is probably your fingers. You have ten of them but you’d be surprised at how far you can count with just those ten fingers. If you let the fingers on your right hand have values of 1, the right thumb could be 5, the fingers of the left hand could be 10 and the left thumb could be 50, you could easily count to 100 on your two hands. If you count in binary (right little finger=1, next finger on the right hand=2, next finger=4, next finger=8, and so on), you could count to 2 to the 10th power, or 1024! It’s surprisingly easy to learn to do.

But, you might be embarrassed to be caught counting on your fingers (school does that to some people). If so, there are mechanical and electronic counters out there. Just punch a button and the count advances by one. There are even counters that have several buttons and you can use each button to keep track of a different event.

Then there are programs you can install on your computer to count things. I have a single button counter installed on my computer. When I was a vocational evaluator, I had a fancy counter that would count several things at once and it kept track of all the counts in a spreadsheet that was compatible with Microsoft Excel. I had to pay for that one.

I’ve even programmed a counter into my copy of OpenOffice Calc. If you know how to program time loops, it wouldn’t be hard to design a counter that let you do frequency and rate counts in a spreadsheet.

So, let’s say that you want to count two things at once. How about counting the number of balls a tennis player hits compared to the number of times they move the racket from one hand to the other during a match. Could you do that? Probably not in your head but if you are comfortable counting to 100 on your fingers, you could easily count the number of hits in your head and the number of hand changes on your hands. Once you fill up your hands, you could stop and you would have the number of hits per 100 hand changes. As little as I know about tennis, I don’t know how much use that ratio would be, but it sounds like a research paper to me.

Cross tabling data

Much of my statistics have had to do with the Were community and that was a source of frustration in the early years of the Were community because so much of the data that had been collected were tabled using the simple tabulation described a couple of sections above. For instance, I had a theory that, as Weres aged, they came to have fewer theriotypes, which would mean that there would be fewer polyweres in the older age groups. There were counts of Weres in various age groups and there were counts of theriotypes. In the second AHWW poll, counts of different age groups of the membership were listed, as were counts of the various theriotypes. It was even stated that there were 1.45 theriotypes per person on the average, but numbers of polyweres per age group was not listed and could not be derived from the simple tabulation. When I came into position of the records of the Werelist, I could finally count how many polyweres were in age age group of their membership. It worked out as follows:

    Age    Single theriotype    Polyweres
    24-28        29                      15
    29-33        14               5
    34-38        5               0
    39-43        3               0
    44-46        1               1

From these numbers, I could even calculate that the probability was 36% that these differences could occur purely by chance (I’ll tell you how later, but if you’re impatient, it’s called the chi squared test). My theory was blown clear out of the water!

This kind of tabulation, breaking counts down into categories, is called “cross tabulation”. It allows statisticians to see into more detail of data sets.

Cross tabulation has become quite easy since spreadsheets have developed pivot table technology. As long as you have a table of individual cases, you can manipulate the structure of the data to give you whatever groupings and summary statistics as you wish. The problem with the early Were studies is that individual data was not published (or even kept, as far as I know). The original Werelist allowed access to a table of the characteristics of the membership with anonymous coding.

Frequencies

A third kind of tabulation – the frequency distribution table – does not display individual data points, it displays counts of values that occur within specified intervals. I’ll demonstrate it using the data from one of my favorite statistical stories, the Old Faithful study. I’ve mentioned it before.

Back in the 70s, the National Park Service began looking for better ways to predict when the famous geyser would erupt so they could post the information for tourists. Trying to predict from past observations didn’t seem to work well. In the August month of 1978 and 1979, measurements of the duration of eruptions and the time period between eruptions were collected and analyzed. The frequency distribution table immediately provided the answer for why a simple prediction scheme had not worked. Let’s look at the data.

To create a frequency distribution table, you first divide the range of values into a convenient number of equal intervals. The geyser intereruption data range from 42 to 95 minutes. The range is, then, 53. I’ve used my spreadsheet program to divide this range into 12 equal intervals: less than 45.5, 45.5 – 50, and so forth. Each interval is also called a “bin”. The first value in the data set is 78, so it gets thrown into the 77.3 to 80.9 bin. The next is 77, so it goes into the 73.8 to 77.3 bin. All the data values are sorted into their respective bins and, then, the number of values in each bin are counted. The result is as follows:

The first column are the bin intervals; the values are the lowest and highest values in each bin. The second column are the class marks – the midpoint value in each bin. The midpoint is just the average of the maximum and minimum value for each bin. The third column is the one of interest to us. Those are the counts of the data values in each bin. So there are 4 data values below 45.5 minutes, 6 between 45.5 and 50 minutes, and so on. These are called frequency counts. It’s easy to see that most of the values cluster somewhere around 75 to 85 minutes, but there is another smaller area where values cluster – somewhere around 49 to 63 minutes. Such a distribution, where data seems to bunch around two clumps of values, is called “bimodal”. Bimodal distributions are rather uncommon in nature but they do occur and, when they occur, it often means that there are two different kinds of data mixed into one data set.

In the Old Faithful study, it turned out that the two kinds of data were generated by two kinds of eruption. In fact, short eruptions were followed by short intereruption periods and longer eruptions followed by longer intereruption periods. When the two different kinds of eruptions were taken into consideration, a much more effective prediction rule was developed which save tourists a lot of waiting around for the famous geyser to blow.

What about those other columns? The one called “Rel. Freq” presents relative frequencies. If you notice, this column adds up to 100, so each number is the percent of the whole data set contained in the particular bin. The first bin contains 1.8% of the whole data set. The column labeled “Cum. freq.” contains what is called cumulative frequencies. Each cumulative frequency is the sum of the frequency of that bin and the frequencies of all the bins above it. The last number, 221, is the number of all the data values. The last column, “Norm. ref.”, gives the frequency counts for a normal distribution with the same mean and standard deviation as the data set. We’ll be talking (a lot!) about normal distributions later. All of these columns can provide useful information to statisticians who are trying to make sense of data.

The shape of the above frequency distribution, that two-humped shape – can be made much more obvious by turning it into a picture, so let’s talk histograms, now.

First look – the histogram

You can make a graph out of a frequency table and all kinds of thinks pop out. Usually a bar graph is made with counts on the vertical axis and the bins on the horizontal axis. That way, the length of each bar indicates the number of values in each bin. Such a bar graph is called a histogram. We’ll look at the histogram in much more detail when we talk about descriptive statistics and graphs later, but, now, here’s a histogram of the Old Faithful data.

The two humps are quite obvious. It’s even easy to get an idea of where the two groups meet and the middle values for each of them. You can, in fact, see pretty much everything you might want to know about these distributions – at least, approximately. Precision is missing and numerical analyses are wanted in order to answer questions with much certainty.

But this is the second step of statistical analysis (after collecting the data). You always want to look at the data first and there is a vast armory of exploratory techniques that allow you to preview data. What kind of questions can you answer with graphs and descriptive statistics?

What value does the data set cluster around?
How far does the data range and how much does it spread out?
What is the shape of the distribution? Does it follow a bell shaped curve? That’s an important question because many statistical procedures require that the data has a normal (bell shaped) distribution.
Is there more than one groups mixed into one data set?
If there are separate sets of data, how do their means and variances compare?
Many statistical procedures, especially when means of different groups are being compared, require that the variances of the populations the groups were drawn from be equal. There are ways to test that.
Are groups of data values related?

By checking out your data before hand, you can see how the distributions are shaped and how they relate and you can knowledgeably plan how to approach a complete analysis of the data.

Let’s compare the ages of the membership of the Werelist circa 2002 with the ages of the respondents of the more recent 2014 survey conducted by White Wolf.

The averages seem to be close but there seems to be more spread and the later distribution seems to be more normal (shaped like a bell curve). The actual averages of the Werelist data and the WhiteWolf data are 29.2 and 22.9 respectively. The older 1997 AHWW poll showed an average age of 23.4. We would want to see if the variances are similar before deciding on a way to compare the means to see if the two groups are actually different or whether there had been no significant change in the age constitution between the two times. We’ll be getting into that later in much more detail.

What’s coming up….

Although, as I have said, statistics is not mathematics, many of the tools of statistics are mathematical (in much the same way that the primary tools of physics are mathematical). There are also important logical and problem solving tools. So, before we go any further in statistics, I want to give you a brief (comparatively) tour of mathematics, logic, and problem solving. Then I want to cover a few very basic ideas in statistics. The next section, in the StatFiles will, therefore, be called “The Basics”.