Cleanliness in a dataset


This file is about features of a healthy R data frame, stored in a file with a .rda or .Rdata extension. Similar principles apply for all datasets, with corresponding shifts in syntax.

You may like to also look at other guidelines documents.

To fix intuition, I say:

$ R
> print(load("ownership_structure.rda"))
> str(ownership)

and after that, I continue to explore the dataset `ownership'. What are healthy features which should be present?

Documentation for a dataset and for the columns within it

Here's a demo of using the builtin comment() of R:

DF <- structure(list(Gender = structure(c(1L, 1L, 1L, 2L, 2L, 2L),
.Label = c("Female", "Male"), class =
"factor"), Date = structure(c(15518, 15524,
15518, 15526, 15517, 15524), class = "Date"),
Dose = c(15, 10, 11, 11, 12, 14), Reaction =
c(7.97755180189919, 11.7033586194156,
9.959784869289, 6.0170950790238,
1.92480908119655, 7.70265419443507 )), .Names
= c("Gender", "Date", "Dose", "Reaction"),
row.names = c(NA, -6L), class = "data.frame")

comment(DF$Reaction) <- "Time to react to eye-dot test, in seconds, recorded electronically"
# From here on, the documentation is visible as --
str(DF)
comment(DF$Reaction)

In this fashion, careful documentation should be placed on data frames, columns of data frames, models, etc.

If you are writing a package, then documentation can be done in style as is done with the built-in datasets of existing R packages. What I've shown here is for custom projects which are not making an R package.

Organisation of directories

It's good to have a DATA directory in which you bring in a host of .csv or .text files. Have one master readin.R program there which reads all these, does all sorts of manipulations, and makes a main.rda file. Have a README which documents each .csv or .text file - where did this come from, what was done to produce it. Have a Makefile which runs the readin.R.

It is often useful to have multiple different definitions of the same variable. E.g. we may mainly like to work with size = log((sales + totalassets)/2). But we may simultaneously like to keep a few other measures around so we can check the robustness of our work to the precise definition. In this case name these as size.2 and size.3 and so on, with definitional details represented using comment() as shown above.


Ajay Shah, 2014