Cleanliness in a dataset

This file is about features of a healthy R data frame, stored in a file with a .rda or .Rdata extension. Similar principles apply for all datasets, with corresponding shifts in syntax.

You may like to also look at other guidelines documents.

To fix intuition, I say:


$ R

> print(load("ownership_structure.rda"))

> str(ownership)

and after that, I continue to explore the dataset `ownership'. What are healthy features which should be present?

As with all computer programming, it is very important to have a consistent and sensible naming scheme for the column names. Use lower case only. Do not use underscores, use dots instead. Prefix the name of all booleans with "is." as is the convention in R. Whatever is done in choosing names, it should be a consistent and logical scheme that is applied throughout. The sloppy invention of each name on the fly, without thought and a consistent scheme, is not acceptable.
Be supremely careful about classification of observations into 0, NA and other values. Everything should fall into the correct category. Nothing should be flagged as NA that is not NA, and nothing should be flagged as non-NA when it is indeed NA. Sloppiness in databases includes falsely showing 0 for what should be NA, or showing NA for what should be 0, and so on.
Think carefully about the data type for every column. Often, you will have a numeric column where the data source had used a string such as "n.a." to denote NA, but you did not correctly take this into account when you did read.table() or something of the sort. As a consequence, the column gets falsely classified as a factor.
Do not classify something as a string when it should be a factor, and vice versa. Fully understand what is a factor and use them as far as possible - but only when this is appropriate.
Understand factors and factor labels properly. Come up with a fresh set of consistently named factor labels when the original source does not have a high quality naming scheme.
Dates or datetime information should be correctly become Date or datetime data types. It is sloppy to leave them as strings or factors.
A common feature of the nightmare of hand-managed spreadsheets is that a file which is supposed to have one row per date will be found to have a date occuring multiple times (the human being typed wrong). Look for such flaws and resolve them if found. Similarly, the hand-managed spreadsheet is supposed to have dates in sorted order, but this will not always be the case. Skeptically test what is going on in front of you.
Many input files have low quality where inside a numeric column, someone has placed one or more values like "20m" for 20 million. Be sure to identify these and fix them.
Many input sources have sloppy work where rigorous and careful coding of factors is not done. E.g. a column "type" may sometimes have a value "Venture capital" and at other times it may have a value "VC". Be sure to identify these and fold them correctly into a single correct organisation scheme. After this, recode the factor.
Do not duplicate data. Every single piece of information should be stored only once, unless you have a compelling reason such as massive compute power that is saved by storing a processed version.
Once you have done all this, make summary statistics of all columns, one by one. If it's numeric, look at kernel density plots. Look at the number of NAs. Think hard about every column and ask yourself: "Does this make sense?".

Documentation for a dataset and for the columns within it

Here's a demo of using the builtin comment() of R:


DF <- structure(list(Gender = structure(c(1L, 1L, 1L, 2L, 2L, 2L),

                         .Label = c("Female", "Male"), class =

                         "factor"), Date = structure(c(15518, 15524,

                                        15518, 15526, 15517, 15524), class = "Date"),

                     Dose = c(15, 10, 11, 11, 12, 14), Reaction =

                     c(7.97755180189919, 11.7033586194156,

                       9.959784869289, 6.0170950790238,

                       1.92480908119655, 7.70265419443507 )), .Names

                = c("Gender", "Date", "Dose", "Reaction"),

                row.names = c(NA, -6L), class = "data.frame")



comment(DF$Reaction) <- "Time to react to eye-dot test, in seconds, recorded electronically"


# From here on, the documentation is visible as --

str(DF)

comment(DF$Reaction)

In this fashion, careful documentation should be placed on data frames, columns of data frames, models, etc.

If you are writing a package, then documentation can be done in style as is done with the built-in datasets of existing R packages. What I've shown here is for custom projects which are not making an R package.

Organisation of directories

It's good to have a DATA directory in which you bring in a host of .csv or .text files. Have one master readin.R program there which reads all these, does all sorts of manipulations, and makes a main.rda file. Have a README which documents each .csv or .text file - where did this come from, what was done to produce it. Have a Makefile which runs the readin.R.

It is often useful to have multiple different definitions of the same variable. E.g. we may mainly like to work with size = log((sales + totalassets)/2). But we may simultaneously like to keep a few other measures around so we can check the robustness of our work to the precise definition. In this case name these as size.2 and size.3 and so on, with definitional details represented using comment() as shown above.

Ajay Shah, 2014