Getting started with the `boot' package in R for bootstrap inference


The package boot has elegant and powerful support for bootstrapping. In order to use it, you have to repackage your estimation function as follows.

R has very elegant and abstract notation in array indexes. Suppose there is an integer vector OBS containing the elements 2, 3, 7, i.e. that OBS <- c(2,3,7);. Suppose x is a vector. Then the notation x[OBS] is a vector containing elements x[2], x[3] and x[7]. This beautiful notation works for x as a dataset (data frame) also. Here are demos:

# For vectors --
> x = c(10,20,30,40,50)
> d = c(3,2,2)
> x[d]
[1] 30 20 20

# For data frames --
> D = data.frame(x=seq(10,50,10), y=seq(500,100,-100))
> t(D)
    1   2   3   4   5
x  10  20  30  40  50
y 500 400 300 200 100
> D[d,]
     x   y
3   30 300
2   20 400
2.1 20 400

Now for the key point: how does the R boot package work? The R package boot repeatedly calls your estimation function, and each time, the bootstrap sample is supplied using an integer vector of indexes like above. Let me show you two examples of how you would write estimation functions which are compatible with the package:

samplemean <- function(x, d) {
  return(mean(x[d]))
}

samplemedian <- function(x, d) {
  return(median(x[d]))
}

The estimation function (that you write) consumes data x and a vector of indices d. This function will be called many times, one for each bootstrap replication. Every time, the data `x' will be the same, and the bootstrap sample `d' will be different.

At each call, the boot package will supply a fresh set of indices d. The notation x[d] allows us to make a brand-new vector (the bootstrap sample), which is given to mean() or median(). This reflects sampling with replacement from the original data vector.

Once you have written a function like this, here is how you would obtain bootstrap estimates of the standard deviation of the distribution of the median:

    b = boot(x, samplemedian, R=1000)           # 1000 replications

The object `b' that is returned by boot() is interesting and useful. Say ?boot to learn about it. For example, after making b as shown above, you can say:

    print(sd(b$t[,1]))

Here, I'm using the fact that b$t is a matrix containing 1000 rows which holds all the results of estimation. The 1st column in it is the only thing being estimated by samplemedian(), which is the sample median.

The default plot() operator does nice things when fed with this object. Try it: say plot(b)

Dealing with data frames

Here is an example, which uses the bootstrap to report the ratio of two standard deviations:

library(boot)

sdratio <- function(D, d) {
  E=D[d,]
  return(sd(E$x)/sd(E$y))
}

x = runif(100)
y = 2*runif(100)
D = data.frame(x, y)

b = boot(D, sdratio, R=1000)
cat("Standard deviation of sdratio = ", sd(b$t[,1]), "\n")
ci = boot.ci(b, type="basic")
cat("95% CI from ", ci$basic[1,4], " - ", ci$basic[1,5], "\n")

Note the beautiful syntax E = D[d,] which gives you a data frame E using the rows out of data frame D that are specified by the integer vector d.

Sending more stuff to your estimation function

Many times, you want to send additional things to your estimation function. You're allowed to say whatever you want to boot(), after you have supplied the two mandatory things that he wants. Here's an example: the trimmed mean.

The R function mean() is general, and will also do a trimmed mean. If you say mean(x, 0.1), then it will remove the most extreme 10% of the data at both the top and the bottom, and report the mean of the middle 80%. Suppose you want to explore the sampling characteristics of the trimmed mean using boot(). You would write this:

trimmedmean <- function(x, d, trim=0) {
  return(mean(x[d], trim/length(x)))
}

Here, I'm defaulting trim to 0. And, I'll allowing the caller to talk in the units of observations, not fractions of the data. So the user would say "5" to trim off the most extreme 5 observations at the top and the bottom. I convert that into fractions before feeding this to mean().

Here's how you would call boot() using this:

    b = boot(x, trimmedmean, R=1000, trim=5)

This sends the extra argument trim=5 to boot, which sends it on to our trimmedmean() function.

Finding out more

The boot() function is very powerful. The above examples only scratch the surface. Among other things, it does things like the block bootstrap for time-series data, randomly censored data, etc. The manual can be accessed by saying:

library(boot)
?boot

but what you really need is the article Resampling Methods in R: The boot package by Angelo J. Canty, which appeared in the December 2002 issue of R News.

Also see the web appendix to An R and S-PLUS Companion to Applied Regression by John Fox [pdf], and a tutorial by Patrick Burns [html].


Return to R by example


Ajay Shah
ajayshah at mayin dot org