Cleanliness in a dataset
This file is about features of a healthy R data frame, stored in a
file with a .rda or .Rdata extension. Similar principles apply for all
datasets, with corresponding shifts in syntax.
You may like to also look at other
guidelines documents.
To fix intuition, I say:
$ R
> print(load("ownership_structure.rda"))
> str(ownership)
and after that, I continue to explore the dataset `ownership'. What
are healthy features which should be present?
- As with all computer programming, it is very important to have a
consistent and sensible naming scheme for the column names. Use
lower case only. Do not use underscores, use dots instead. Prefix
the name of all booleans with "is." as is the convention in
R. Whatever is done in choosing names, it should be a consistent and
logical scheme that is applied throughout. The sloppy invention of
each name on the fly, without thought and a consistent scheme, is
not acceptable.
- Be supremely careful about classification of observations into 0,
NA and other values. Everything should fall into the correct
category. Nothing should be flagged as NA that is not NA, and nothing
should be flagged as non-NA when it is indeed NA. Sloppiness in
databases includes falsely showing 0 for what should be NA, or
showing NA for what should be 0, and so on.
- Think carefully about the data type for every column. Often, you
will have a numeric column where the data source had used a string
such as "n.a." to denote NA, but you did not correctly take this
into account when you did read.table() or something of the sort. As
a consequence, the column gets falsely classified as a factor.
- Do not classify something as a string when it should be a factor,
and vice versa. Fully understand what is a factor and use them as
far as possible - but only when this is appropriate.
- Understand factors and factor labels properly. Come up with a
fresh set of consistently named factor labels when the original
source does not have a high quality naming scheme.
- Dates or datetime information should be correctly become Date or
datetime data types. It is sloppy to leave them as strings or
factors.
- A common feature of the nightmare of hand-managed spreadsheets is
that a file which is supposed to have one row per date will be found
to have a date occuring multiple times (the human being typed
wrong). Look for such flaws and resolve them if found. Similarly,
the hand-managed spreadsheet is supposed to have dates in sorted
order, but this will not always be the case. Skeptically test what
is going on in front of you.
- Many input files have low quality where inside a numeric column,
someone has placed one or more values like "20m" for 20 million. Be
sure to identify these and fix them.
- Many input sources have sloppy work where rigorous and careful
coding of factors is not done. E.g. a column "type" may sometimes
have a value "Venture capital" and at other times it may have a
value "VC". Be sure to identify these and fold them correctly into a
single correct organisation scheme. After this, recode the factor.
- Do not duplicate data. Every single piece of information should
be stored only once, unless you have a compelling reason such as
massive compute power that is saved by storing a processed version.
- Once you have done all this, make summary statistics of all
columns, one by one. If it's numeric, look at kernel density
plots. Look at the number of NAs. Think hard about every column and
ask yourself: "Does this make sense?".
Documentation for a dataset and for the columns within it
Here's a demo of using the builtin comment() of R:
DF <- structure(list(Gender = structure(c(1L, 1L, 1L, 2L, 2L, 2L),
.Label = c("Female", "Male"), class =
"factor"), Date = structure(c(15518, 15524,
15518, 15526, 15517, 15524), class = "Date"),
Dose = c(15, 10, 11, 11, 12, 14), Reaction =
c(7.97755180189919, 11.7033586194156,
9.959784869289, 6.0170950790238,
1.92480908119655, 7.70265419443507 )), .Names
= c("Gender", "Date", "Dose", "Reaction"),
row.names = c(NA, -6L), class = "data.frame")
comment(DF$Reaction) <- "Time to react to eye-dot test, in seconds, recorded electronically"
# From here on, the documentation is visible as --
str(DF)
comment(DF$Reaction)
In this fashion, careful documentation should be placed on
data frames, columns of data frames, models, etc.
If you are writing a package, then documentation can be done in
style as is done with the built-in datasets of existing R
packages. What I've shown here is for custom projects which are not
making an R package.
Organisation of directories
It's good to have a DATA directory in which you bring in a host of
.csv or .text files. Have one master readin.R program there which
reads all these, does all sorts of manipulations, and makes a main.rda
file. Have a README which documents each .csv or .text file - where
did this come from, what was done to produce it. Have a Makefile which
runs the readin.R.
It is often useful to have multiple different definitions of the
same variable. E.g. we may mainly like to work with size =
log((sales + totalassets)/2)
. But we may simultaneously like to
keep a few other measures around so we can check the robustness of our
work to the precise definition. In this case name these as size.2 and
size.3 and so on, with definitional details represented using
comment() as shown above.
Ajay Shah, 2014