Things to be careful about in R


This files is about mistakes I made or still make; things that I found to be odd and surprising about R.

Handling of multi-line expressions / statements

In R, the use of semicolons between statements is optional, and most people don't bother. So when you say:

   y = 2 + 3*x
       + 7z

there is a risk that he thinks the first statement ended on the first line, i.e. that you said y = 2 + 3*x

The Unix use of backslash for long lines does NOT work since the backslash means something else to R. (!!!)

Solutions to this: (a) Use emacs, it helps you to detect some such errors, (b) Try to use semicolons, even though R doesn't need them, (c) You have to signal to R that an expression is incomplete - e.g. put "y = 2 + 3*x +" on the 1st line, so he knows something is to come. You can force a "(" or a "{" on the 1st line so he knows there is more to come.

min() and max() result in mistakes

When there are NA (not available) values in the data, min() and max() misbehave. You need to explicitly SAY : hi = max(X, na.rm=T) to get him to throw away NA data before computing the max.

Namespace

Consider this program:

demo <- function(t) {
   return(t+y);
}

y=7;
print(demo(3));

In a `normal' programming language, you'd think that y is uninitialised, and hopefully the language you are dealing with should generate an eloquent error message. In R, the namespace at calling time contains y, so that quietly propagates as a global variable into the function! So instead of getting an error message, you'll get the result 10! Be careful.

try() and other variable names you shouldn't use

Lots of people, including myself, use "try" as a variable name. Be aware that try() is in the base package and that is a conflict.

Be careful : don't use the variable names 'c' 'q', 'T', 'F', 't' since they all mean something in R.

Reading in data files

As Unix people, we are used to thinking that awk or perl will perfectly handle (say) pipe delimited data files. We say that the FS is pipe, and all is well from here on.

By default, this is NOT how R thinks. You can say sep="|" but if the data contains quotes or hashes, he gets unhappy. Read ?read.table carefully to understand the situation. There is also a Data Import/Export Manual. One way out is to say quote=NULL as an arg to read.table()

Another problem that happens is like this. Suppose you do:

$ awk -F\| '{print $2}' file1 > col2

and then you try to eat the file col2 in R, saying something like x = read.table("col2"). If there are blank lines in col2, they will not be intepreted as missing data; they will be quietly glossed over.

When faced with oddities in reading stuff, write the file out after reading it, and compare with the original. That helps track things down.

Printing inside a function

At the R commandline, if you say:

  m = lm(y ~ x)
  n = summary(m)
  n

what happens is that summary.lm() gets run on the object m, and it (in turn) returns the object n. When you say "n" (the 3rd statement), this is tantamount to saying "print(n)".

This can be collapsed as:

  m = lm(y ~ x)
  summary(m)

where you are running summary.lm() on m, and printing the result.

This does not work within a function. The following code will not behave as you think:

f <- function(x, y) {
  m = lm(y ~ x)
  summary(m)
  return(42)
}

The rules of the game inside a function are that unless you explicitly say print() no printing is done. To make f() here work, say:

f <- function(x, y) {
  m = lm(y ~ x)
  print(summary(m))
  return(42)
}

Return to R by example


Ajay Shah
ajayshah at mayin dot org