Guidelines for projects based on R

Guidelines for projects based on R

I find these guidelines useful in my work. Your mileage will, of course, vary.

File system organisation

I find it useful to organise each project as a set of directories SRC DOC RESOURCES DATA:

SRC
This is where the source code of the project resides.

DOC
This is where the papers, slideshows, etc. reside.

DATA
This is where the data files reside. Typically, I would have some underlying raw material (e.g. as .csv files) and then I would have a makedata.R which writes a dataset.rda, which is then used by all the analytical programs.

RESOURCES
This is where I keep other resources such as useful documents, pointers into the web, a .bib file for the project, etc.

Tying things together

It's very useful to have a Makefile in the SRC DOC and DATA directories. Crucial elements of this should be:

"make" should rerun all programs and recreate all the files in their full glory. E.g. in the DATA directory, "make" should create the dataset.rda file. In the DOC directory, "make" should make the .pdf file for the slideshow and the article.
"make clean" should remove everything that's easily recreated.
"make squeaky" should remove every single thing that's written by computer programs - i.e. in the SRC and DOC directories, when you say "make squeaky" you should go down to purely human-written files only.

It is often useful to do things using Sweave: this also reduces the clutter of separate R programs, the pdf files that they write, etc.

When taking tables from R into LaTeX, it's often best done by xtable, but what I find convenient is to write out a file x.gen, and then \input{x.gen} from within the LaTeX. This way, when you have rerun your .R program, you're guaranteed that a most-recent .gen file has been created. And, it's easier on the mind to just delete .gen files. If you used a filename with a .tex extension, you'd have to squint at it sometimes, wondering whether this was one of the files that you wrote.

Approach to code

Some people view code as a set of instructions they give the computer to get their job done.

I think the useful way to build code is to think of it as an essay which is written for the benefit of your co-author. A well written program is an engaging conversation which explains what is being done and why. A good program is not just one that works correctly; it is a feat of communication.

Why is this the right approach?

Almost all work these days is a collaborative enterprise
This invisible "co-author" lurking over your shoulder, that you are communicating to, could likely turn out to be yourself of one year later. All too often, one has to go back to a project after a gap of one year. Under such circumstances, clean and well written code, which clearly talks about what is going on and why, makes a huge difference.
Would you write better quality code if your co-author was actually present by your side when you were writing? The hypothetical co-author is an invisible check upon your writing crud. Harness this peer pressure as a motivational device to push you onwards to writing better stuff.
In life, it is likely that some of your code will be read by others. At such times, you suffer serious reputational damage owing to having written crud. Put your best foot forward; go out in your Sunday clothes. If and only if your code is of high quality, you will have more confidence in putting some of it out as open source packages, which would be a big step forward for you.

Just as an essay is a group of paragraphs, each of which is a group of sentences, think of a program as a group of paragraphs. Each paragraph should do one coherent chunk of work. I like to announce that objective with a comment at the start of the codeblock.

Writing functions

Novice programmers write repetitive code. The moment you see something being again and again, generalise it into a function. Visualising interesting functions is the hallmark of your journey into computer science maturity.

A good function has the following features:

It should be easy to describe : you should be able to communicate the idea of what the function does, in conversation, to someone else in no more than a minute. This purpose is assisted by having a name for the function that is a verb.
It should have args that cover an array of cases; i.e. it should be fairly general.
It should turn out to be useful in many many situations, to you and to your colleagues, and ultimately even make it into a package on CRAN.

It is generally not a good idea to write functions which are just one function call. E.g.:
  draw.clever.plot <- function(x, ylab) {
    plot(x, ylab=ylab, xlab="", col="blue", lwd=2)
  }
With this, the author has to struggle to learn one more function (my.clever.plot()) but his gain is very low, since he could just focus on knowing plot() and writing a one liner out of that. This is not a useful function.

Basic issues

It is essential that lines of code must never cross the 80th column. To have long lines that spill beyond the 80th column is just sloppy.

The human mind is able to only fully comprehend all the code that's visible in one screen. So don't waste vertical space. Don't waste space on empty lines, empty comment lines, etc. Treat vertical space as precious. If you can say more within one screen, that is better. (But of course, don't go off to the other extreme, cramming a lot of content into one screen in a way that makes it unreadable).

Do not have extra blank lines at the top or the bottom. Be very careful and disciplined about all these small things. A good programmer is a craftsman, bringing an inhuman perfectionism into his task.

Indentation

Indentation is terribly important. The simplest path is to use emacs/ess and hit TAB on every line. This will give you fair quality indentation for free.

Consistency

In all practical detail about how a program is written, it is important to wake up and have a style. The people who write crud are unaware of what they're writing, and every few lines something different is done. Wake up. Choose a style. And then remorselessly and perfectly roll it out across all your code.

If you will have a space before and after the "<-" then do this consistently everywhere. (I think it's a bit better with).

If you will have no blank after each comma, then do this consistently everywhere. (I think it's a bit better with).

There are many possible styles. Here's the most excellent google R style guide. It is not important to squabble about what is a good style. It is absolutely essential to have a style, and then to be 100% consistent with it.

If you have a historical or inherited code base that is ugly, the formatR package will help you overcome the one-time cost of converting ugly to nice.

Clarity and R

If you find yourself doing tedious hard work, you are most likely not using R properly. As an old fortune goes, It is possible to write Fortran in any language. In similar fashion, it is possible to write painful code which does something step by step, which could be hand-translated from Fortran or Gauss or some other statistical program. If you're doing this, there's no point in writing R.

As an example, the R concept of `factors' and their consistent use in all standard functions implies that in R you don't need to create dummy variables and manually specify a list of dummy variable regressors in a model specification.

On the other hand, there is no point in going out on limb writing R code that you consider cute but everyone else considers incomprehensible. You are not in a contest where you are showing off your high IQ based on the contortions that your code is capable off. The goal is to write correct, comprehensible, efficient, extensible code.

Consider using S3 with print and plot methods

In some situations, when you have written a function do.task(), and are despatching a complex list at the end, it is useful to attach a class "task" to this returned object. After that, you can write a print.task() function and plot.task() function, so that the recipient of your complex "task" structure has easy access to printing and plotting this.

If you have a few different interesting plots for a single data object, then put all this functionality into a single plot.task() function selectable by which as with plot.lm(). This will reduce the clutter of accumulating multiple plot functions, and increase the odds of having greater consistency in the code and in the look of the plots.

Testing

You will notice that the first of the four adjectives above is: Correct. Most computer programs in the world are wrong. Most computer programs in the world fail when subject to rigorous testing. The most important single challenge that we face is that of building ever-more-complex edifices of code while delivering correct code.

In order to build a big edifice, you have to have sound foundations. It is useful to focus on a core set of functions which get the bulk of the functionality done, and hammer away at writing them well and testing them well. This is 90% of the job of any project. The rest is data plumbing and presentation.

The trust that we place on any one building block is based on how well tested it is. What is testing and how can we improve it?

Testing is the formal process of feeding in an input and verifying that the expected output is generated. Testing is not something done informally. It must manifest itself as code and data files. Here is a trivial example. Suppose we have a function which computes the cube of a number:
  cube <- function(x) {
    x*x*x
  }
How do we test this? We make a list of inputs and the expected output:
  x <- c(1,2,3,4)
  x.expected <- c(1,8,27,64)
and then we verify that the function behaves as expected:
  all.equal(cube(x), x.expected)
Suppose this is organised as two files:
  # ---------- the file cube.R :
  cube <- function(x) {
    x*x*x
  }
  
  # ---------- and the file cube_test.R :
  source("cube.R")
  x <- c(1,2,3,4)
  x.expected <- c(1,8,27,64)
  all.equal(cube(x), x.expected)
Now you would be able to say:
  $ R --slave < cube_test.R
and verify that cube.R has passed all the test cases.

The key point is: Anyone can run all the test cases, anytime, with zero effort. That is the hallmark of formal testing. Testing should not be an occasional thing, it should be an everyday rigour in the development process.

Every time you touch cube.R, you are introducing fresh bugs into it. Hence, every time you modify cube.R, you can and should rerun cube_test.R to verify that it still works. All the test cases in a project can be strapped together with a "make test" thus making regular checking for mistakes an everyday affair.

This brings us to the question of what is in the list of test cases. Testing is truly hard. There are four strategies through which test cases can be created:

Theory as a guide
Sometimes, theory guides us on what the answer should be from a function when a certain input is supplied. Ask yourself: Is there a simplified special case of what you are doing which collapses into a well known problem?

Replicating published results
Sometimes, a well trusted published result is visible which tells us the correct answer for a certain input.

Comparison against other well trusted implementations
Sometimes, we are able to take a given dataset to other software implementations, and then compare the resulting answers.

Estimation using a simulated dataset
For problems of estimation, it is always possible to write down a true parameter vector and simulate a dataset from the true model. When this simulated dataset is fed back to your estimation function, in large sample, it should recover the known true parameter vector.

When you compare the results against standard packages like stata or gauss, here's a sad fact. If your answers agree exactly, that's good news. But if the answers disagree, it can be owing to mistakes in the stata or gauss implementations. Similarly, a disappointing number of published papers in economics contain data and/or computational mistakes. These mistakes, and the lack of interest in reproducible research, are the soft underbelly of the profession. Hence, a wise strategy in testing is one which involves all the four paths.

It is a good idea to go to unreasonable lengths on testing. Investigate all these four avenues, and try to amass as many test cases as possible into your `test plan'. The more the testing that is done, the more your code can be trusted. It is good to write a file TESTPLAN which gives a strategic picture of what are all the tests that you are doing, and how you have amassed these tests.

`Code coverage' is an issue. The full battery of test cases should endup exercising all aspects of the functionality. Suppose you have an f(x) and you are thoroughly testing for positive x, but you are not doing a single test for x=0 or for negative x. This is a situation where your test plan lacks comprehensive code coverage. If you are only testing for x>0, you cannot rule out the possibility of errors for x=0 and for x<0.

As a thumb rule, suppose you find that it takes a1 man-days and a2 lines of code to get up to a first `fully working' implementation. My thumb-rule is that it would take 3x to 4x times the effort to setup a high quality test plan and to solve the problems that it inevitably throws up. That is, you have to plan for something like 4a1 the man-days and 3a2 the lines of code in the testing process. By the time you get really good at this, you will start thinking about the test plan and building it right from the early stages of any project; testing should not be an afterthought.

Endgame: A well loved CRAN package

If you do a really good job of coming up with functions that have:

More general applicability,
Are simple and parsimonious in description and yet pack a mean punch in terms of usefulness,
Have high quality testing, and
Computational efficiency

you could be on the road to a well loved CRAN package.

You are ripe for a package when you have a coherent group of inter-related functions, documentation for each of them, and a testing framework. It would make sense for you to get going with a version 0.01 of the package, purely for yourself and your immediate colleagues, and then down the line when it reaches a more mature 0.02 version you can put it out.

Ajay Shah, 2010