Handling Data

What is data?

When you say “data”, most people think “numbers” but data can be many things… geographic regions, excerpts from classics, old family photographs, video clips, transcripts of folktales told by indigenous people, the holdings for a library museum…

Anything that can be analyzed is data. In fact, after statistics are generated from data, they can become data. Meta-analysis looks at groups of research results, combines them, and summarizes them to come up with overall results for a topic.

Quantitative research is generally limited to numerical data – counts and measurements, but qualitative research works with many kinds of data. Dr. Proctor (Proctor, Devin, 2019) in his exploration of the Therian community, interviewed therians in many different venues including telephone interviews, Second Life, and online forums and drew conclusions from the information he collected. The resulting article had few numbers in it. Troy D. Allen’s study (Allen, Troy D. (2014) Other-Than-Human: A Qualitative Narrative Inquiry Into the Spiritual Development of Therians. Doctoral Dissertation. Argosy University, Phoenix Campus, College of Behavioral Sciences.) does not present a lot of numbers explicitly, but he used a method called “tag clouds” or “word clouds” that infers the importance of a concept by the number of times it is mentioned in a passage and displays the results in a graphic that indicates the importance by the size of the word. He also used a conceptual framework to help explore interviews in an organized fashion.

A survey may ask “What is your place of birth?”, count the frequencies of particular answers, and calculate the percentages of each different answer given by all the respondents.

You could determine how many of your photographs are of people and how many are of landscapes by simply counting each or, if you have an inventory of your photos in a spreadsheet with descriptions of each, you could just have the spreadsheet do the counting for you.

Data is information in any form.

Records and fields

Language used to describe data varies according to discipline. A “field” in business is called a “variable” in physics. A “case” in a physics experiment would be a “subject” in a psychological experiment, or a “customer” or “client” in business.

In data processing, a record is all the information collected on an individual and a field is a particular item of information. A database may have a record for John which includes an entry in the address field.

The record is a data vector. A vector in math, physics, statistics, or data processing is an ordered collection of data. The place a value occupies in the collection of data is as important as the value itself. 53,000 is meaningless until you know that it is in the “yearly income” place (record, variable) of a database.

Standard table formats – spreadsheets

Data are usually stored in tables in databases and spreadsheets. Databases used to be rows of filing cabinets. Computer programs have largely replaced those. A room full of paper data can easily be stored on a disk drive smaller than an 8 ½ x 11 inch sheet of paper.

Some of the earliest records of human history are tables of data – inventories and invoices. And today’s standard format has been popular just as long. As humans, we can format our data any way we please but computers expect specific layouts. With the computer age came standards.

There are two common formats for tables of data in spreadsheets. Only one is regularly used in databases. That one places cases in rows and fields (or variables) in columns. In a table of personal data, each person has their own row. There may be a column for telephone numbers. If you ask a database for Bob’s telephone number, the computer will look for Bob’s row and the telephone number column. Where they intersect, that’s where Bob’s telephone number resides.

The other common table format is the contingency table. It’s commonly used to table joint counts. Joint counts are counts of elements in overlapping classes.

For instance, when I did the study on smiling, I counted the number of people I smiled at vs. the number of people I didn’t smile at. That was the independent variable because I controlled it – I consciously chose who I smiled at. I also counted the number of people who smiled back at me vs. those that didn’t. Since I was asking if this was dependent on whether I smiled at them or not, it was my dependent variable.

I tallied each subject ( that is, each person I met on the trail) into each group do I could easily count, say, how many people I smiled at that smiled back. This is an overlap of the two groups, thus, a joint counts.

A standard contingency table gives the independent variable the rows, and the dependent variable the columns. Here’s the smile data again.

It’s immediately obvious from this table that the subjects followed my lead.

Tables beg for more analysis, but they’re a great first step for previewing data and a clean, organized table can tell you a lot about your data from the very start.

Inputting data

To me, a spreadsheet is indispensable for data processing and analysis. In many of the older statistics programs, when you entered your data, it was right there – in the computer’s memory – out of sight.

Many of the older packages like SAS and SPSS have adopted a spreadsheet mode of data entry, but programming languages like R and Python are also popular for data analysis. For those, if you want to refer back to your original data set, you have to have it open in another program on your screen while you fiddle around with the program editor.

There is a reason, though, to use the languages. In a spreadsheet, each time an individual data value is accessed, the program has to go through all the rigamarole of parsing the objects (sheets, rows, columns, cells…) on the sheet to find it and move it to wherever it needs to go for processing. That takes time and some datasets are huge. Sometimes, they are so huge that a spreadsheet just can’t handle them.

But if the dataset is small enough, and most spreadsheets can easily work with many thousands of data points, a spreadsheet allows you to store the tabled data, graph it, and lay out a full analysis in the same document.

There are usually several ways to get data into a program. DANSYS works like a standard spreadsheet, in fact, it works like LibreOffice Calc with added functions. In any spreadsheet, your first option for entering data is very straightforward…type the data into the cells one at a time.

That quickly gets tedious and, if you already have the data in some form, say on a website or in a file stored on your computer, you begin wishing for a way to just move the data over to your spreadsheet or database en masse.

Check your users guide because there probably is. One common method for moving data is cut and paste. Most modern computer software including Internet browsers have provisions for using a section of your computer’s memory as a “clipboard” for moving data around. Commonly, the menu will have an Edit>Copy command, or you can press something like Ctrl+C to move a selected text or table onto the clipboard and then use Edit>Paste or Ctrl+V to place it where you want it in the active document at your mouse cursor.

Spreadsheets and databases also have ways of bringing a csv file into a document. A csv file is just a text file with a .csv extension (like, my data.csv). Actually, if it’s a text file, it doesn’t even have to have the .csv extension. You just have to select the file you want in a browser.

“csv” stands for “comma separated values” and it implies that the file contains lists of numbers or words separated by commas on lines separated by carriage returns, but the way most import functions are set up, the separators can be just about anything as long as you can tell the program what they are. For instance, if you have the first passage of Charles Dickens’ Tale of Two Cities in a file, and you want to use it as data, you can tell your spreadsheet to load the file into a sheet, using spaces as data separators and periods as line separators. Individual words will go into individual cell and each sentence will go into its own row.

Most database, data entry, or data processing programs will give you lots of options for quick and easy data entry.

Pivot tables

One of the greatest advantages that spreadsheets offer for data processing is the pivot table.

Since the term PivotTable was trademarked in 1994 by the Microsoft Corporation (it is now in the public domain), the generic term is “pivot table” (two words) and many software developers use their own names, but the method was just too doggoned useful to be hogged by a single company so now all the spreadsheets and many other data processing programs have it.

Different programs for pivot tables a little differently, so check your user manual.

A pivot table is an interactive data table. You should always place your data in a standard format, but then, you should play with it. All work and no play not only makes statisticians dull, but it causes them to miss important facts that are right in front of them. Good tabling manners are a part of every complete data exploration!

Once you have a pivot table set up, you can move around what goes in rows and what goes in columns. Instead of individual data values in a raw table, you can count them and their joint frequencies in the raw data. In other words, using a pivot table, you can easily transform a two variable raw data table in which both variables have multiple values (numbers or words) into a contingency table.

One problem with a contingency table is that it can only practically accommodate two variables. You can, of course, create multiple contingency tables side by side if you have more than two variables, but you can also nest variables in the same table and you have a stub and banner table. That makes it much easier to compare all the variables.

Remember this stub and banner table from Analyzing Nominal Data?

In this (fictitious) table, the number of people who smiled at me and those that didn’t are grouped by their genders.

In pivot tables, cases can be grouped according to values of specified variables and summary statistics (sums, counts, averages, etc.) can be inserted (just about anywhere) into the table.

Imagine a table of all your friends, their addresses, phone numbers, email addresses, and birth day dates, all grouped according to their genders and which states and countries they live in with how many are in each group.

Multiple tables

When Ronald Fisher developed Analysis of Variance, he worked a lot with agricultural problems and the nomenclature and structuring of data still reflect that. Plots of land were set up with plants in blocks that were treated differently…different fertilizer, different genetics, different soils…what made a difference? The data could then be arranged in the same spatial relations as the plants, in matrices of tables.

That’s tidy for humans. These tables are different amounts of light. Inside each group, tables are different amounts of water. Rows are for different individual clumps of plants by species; the columns are different plant foods. The recorded data are counts of flowers that open. Researchers can throw in as many variables as they want.

But you’ve seen how computers handle data. For instance, how would you structure data on a spreadsheet for a complex analysis of variance?

There are a couple of common solutions.

If you know the layout beforehand, you can have the program expect the data to be in blocks and sub-blocks as needed. The individual tables can either be flush to each other where the user can specify the sizes of the tables, or the data can be in multidimensional tables. The first option is easier to scan visually.

1 2. 3. | 1. 2. 3

C 4. 5. 6. | 4. 5. 6

7. 8. 9. | 7. 8. 9

1 2 3. | 1. 2. 3

D4. 5. 6. | 4. 5. 6

7. 8. 9. | 7. 8. 9

These four subtables would be specified as having 3 rows and 3 columns.

DANSYS takes this tact. The problem is that each layout requires a different program. There are common layouts: simple ANOVA, three-way ANOVA, repeated measures, etc. But this method won’t generalize to more complex layouts.

So, for a more general approach, extra columns can be added to a single table to show what categories each data value belongs in. The above tables can then be placed in a single table that looks like this:

1. 1. A. C
1. 2. A. C
1. 3. A. C

||. ||. ||. ||. ||

6. 2. 3. B. C

||. ||. ||. ||. ||

9. 3. 3. B. D

The first column is data values, the second is the row within the subtables, the third is the column, the fourth is the subtable column category, and the fifth is the subtables row category.

Databases

Another strategy is to simply use a program designed to store complex sets of data…a database. Most modern spreadsheets are designed to be able to interface with one or more databases and pull out information as needed.

There are, broadly, two kinds of databases: simple and relational.

A simple database stores data in a table, much like the ANOVA example above, storing both data and categories for each case. Most spreadsheet applications let you build simple databases on sheets and provide special functions that will select cases according to criteria. For instance, a spreadsheet table that includes a column of ages for the cases can be searched for cases for people aged 25 to 50.

Relational databases allow for multiple tables that can be linked together in complex ways. For instance, a table of contacts may have physical and Internet addresses but may be linked to another table that holds personal information like birth dates and favorite foods for some or all of the people in the contact table. Both tables, then, can be searched for specific people and the results can be assembled into a query table.

Modern spreadsheets usually have utilities to interface with external databases.

Visualizing data

We’ve talked about some graphic techniques, for instance on the Descriptive and Exploratory page. They’re important because it’s hard to make sense out of pure data in tables. But it’s useful to see what data is doing at the outset to make decisions about how to proceed with an analysis.

It’s part of human nature that humans have expectations. Even in a study where a person is just curious about what a process “looks like”, they already have an idea about what the data will do before they even collect it. One of the funner elements of research is when data doesn’t look like you expect. Hopefully, the discrepancy is actually something interesting and not an outright error.

Outlines and summaries are common capabilities of both spreadsheets and databases. And we’ve talked about pivot tables. These utilities can help you cluster data and generate summary statistics like sums and averages of the categories and their levels so you can compare them. Do they line up with your expectations? If not, why?

But there are certain things that you should certainly look at that could indicate nonrandom error. The usual noise in data is usually random and averages out, but nonrandom data caused by systematic lack of instrument precision, shoddy measurement techniques, poor environmental control, and such can strongly affect the conclusions of research.

Data is messy and often needs to be cleaned up.

Cleaning data

Many things can happen between measurements and conclusions. You can make errors while taking measurements and/or counts. You can record the values wrong. Data can be lost or someone else can put their hands on your data. In the end, some of the values in your raw data tables can just be flat wrong…or are they just unusual?

Your job is to find all the wrong data and explain the unusual data, and it’s an art, not a science. The more exposure you have to data, the sharper your intuition of what is accurate and what is not will be.

Two of the most common types of problem data are missing data and outliers.

Missing data

Social desirability is an issue in survey research. Say you were asked to fill out a survey and in it you encountered items such as, “I have used illegal drugs,” or “I have indulged in deviant sex practices,” or “I have killed someone.” Even if the researchers did ensure anonymity, I suspect that you would be hesitant to respond, especially if a response is “yes”.

If you were the researcher, what would you do with all the empty items?

Again, data is messy and any data set may have empty or obviously erroneous values. The technical term is straightforward…”missing data.” Statistical software will offer multiple options for dealing with missing data. The simplest would be censoring.

With censoring, you just ignore the missing values. The complications are that you can ignore the value in a calculation, or you can take out that whole record (if you don’t have Jim’s answer about his sexual preference, just drop all his information from the study), or you can just drop that whole variable. Thing is, both Jim and respondents’ sexual preferences may be important to the study and by dropping information, you reduce the size of your sample. If one item is missing, the reduction may not be too bad, but if several items are missing things like sample representativeness, sensitivity, reliability, accuracy will be affected.

The practice of replacing a missing value with a 1 or 0 is generally a bad idea. The exception is when “1” or “0” are specified as codes for “missing value” and won’t be used in numerical computations.

Another common option is to use the mean of the variable with missing values (or if its repeated measures, the mean of the scores within each case might be used.) That allows a computation where missing values are recognized as such but a “safe” value is used to replace them.

A safer option is to predict what a value should be if it were not missing. Some form of regression would be used.

After missing values are filled in and sample sizes are adjusted, calculations can proceed as usual.

Outliers

The tallest person in modern history was Robert Pershing Wadlow at 272 cm. The average male height is 177.8 cm with a standard deviation of about 10.2 cm. That places Mr. Wadlow over 9 standard deviations above the mean height of men in the modern world. He is definitely an outlier.

But 272 cm is not an erroneous measurement. Mr. Wadlow actually existed and he was demonstrably 272 cm tall.

On the other hand, if you were measuring the melting point of a substance from a homogeneous batch and all of your measurements fell within 2 degrees of 75° centigrade except one, which was 752° centigrade, you could reasonably assume that the outlier was simply recorded wrong.

But “outlier” is a pretty fuzzy concept, right? Where do you draw the line?

Well, it is fuzzy, but not because there isn’t a sharp line you can draw to determine whether a value is an outlier or not. It’s fuzzy because there are several lines you could reasonably draw.

First, you need to decide how far from the mean you want a data value to be before you consider it an outlier. In a normal distribution, you know that any value one standard deviation above or below the mean is a little unusual. About 42% of data values can be expected to lie outside that range. About 8% will be two standard deviations away from the mean, so that would certainly be unusual. Since only 4% of the data is expected to deviate from the mean by three standard deviations, a value outside that range can pretty safely be called an outlier.

But what if your data isn’t from a normal distribution? If it’s from some other known distribution, you can use the deviation parameter for it. If you don’t know the shape of the distribution, you can use nonparametric statistics. The most common for identifying outliers is the semi-interquartile range. The semi-interquartile range is half the distance between the first and third quartiles. Again, values one semi-interquartile range away from the medium is sorta unusual, one two semi-interquartile ranges is rather unusual, and one three semi-interquartile ranges away would certainly be suspect.

The nonparametric indicator of outliers can be used even if you know the shape of your distribution.

So, Michael Jordan is 6 feet and 6 inches tall. Is that height an outlier? Human male height is probably pretty close to normal and, since we already have the statistics, we can use the mean and standard deviation to find out.

The mean for human males heights is 178 centimeters. The standard deviation is 10 cm. One standard deviation above the mean is 188 cm. Two standard deviations above the mean is 198 cm. Three standard deviations above the mean is 208 cm. Michael Jordan is 6.5 feet or 198 centimeters. In addition to being an outstanding person, Michael Jordan’s height could well be considered an outlier.

So, what do you do with outliers after you identify them?

Well, Michael Jordan’s height is a matter of public record, so it can be verified, and it is well within the range between the heights of the tallest and shortest people on record, so we, probably, should not discard it as an error.

For data that are not obvious errors, I generally recommend that two analyses be run, both with and without outliers. If there are any questions about the validity of the extreme values, both sets of statistics can be reported with any suspicions that apply.

If some data values are obvious errors, make a copy of the original data, expunge the errors from the copy, and run your analyses on the modified copy, treating the deleted data values as missing data. Never…throw…out…original…data!

Coding data

You’ve probably noticed that there’s nothing inviolable about the names of variables. As long as you stay consistent throughout an analysis, you can name any variable anything you want. It just behooves you to name it something that will remind you what it is.

It may not be so obvious, but data values can also be named and renamed with impunity. We’ve talked about data transforms. That’s basically changing the names of data values to something more convenient.

Another reason for changing the names of data values is to change their level of measurement. For instance, you can’t apply mathematical operations to words…you can apply them to numbers. If I wanted to see how smiling at other people on Bear Creek Trail correlates to them smiling back, I could change the name of the variable “I smile at them” to “1”, “I don’t smile at them” to “0”, “They smile back to “1”, and They don’t smile back” to 0, and calculate a Pearson product moment correlation coefficient for the data.

This “changing the names of data values” is called “coding” and “recoding” and is so common among statisticians that most statistics software will do it for you.