R’s drop “gotcha” and the diag “curse”

This title is too dramatic. But it is justified. There really is a "drop gotcha" that has crushed my spirits, and the diag "curse" has caused me to lose sleep on many occasions.

In 2013, I noticed a problem in several of my functions. I call this problem the "drop gotcha." I've been meaning to write it up for the blog, but am lazy and was just sending everybody some emails about it. Now I'm collecting that up.

This one has some general educational value, I think, for R programming. I may put it into the Rchaeology vignette, it is by far the most common flaw in otherwise well written functions. Virtually everybody I know has had the "drop gotcha".

The problem is that R tries to simplify data structures, and sometimes we accidentally wander into the "simplify-able" range, and then ZAP!. We are often working with matrices, and we are going write code that chooses rows or columns. The program, or the user through the program, is supposed to choose pieces from a matrix, and along the way that matrix gets "damaged" an so we get crashes, but only sometimes. Sometimes we get WRONG answers, which is worse.

The first problem: the diag() function can give an unexpected answer, or fails,
when you least expect it. This is due to the fact that diag() is both an
extractor and a creator function. Those 2 purposes make the function appear
"handy" and economical, but it is sometimes a real frustration.

One purpose of diag() is to create a diagonal matrix with p elements, say p = 4

> X <- diag(4) 
> X
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    0    1    0    0
[3,]    0    0    1    0
[4,]    0    0    0    1

If the argument is a vector, all is well, it puts a vector along the diagonal of the properly sized matrix.

> X <- diag( c(1,2,3,4) ) 
> X
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    0    2    0    0
[3,]    0    0    3    0
[4,]    0    0    0    4

On the other hand, if p happens to be a matrix, then diag will extract the
diagonal, creating a vector.

diag(X) is

> X <- diag( c(1,2,3,4) ) 
> diag(X)
[1] 1 2 3 4

The problem is that diag does something with anything it finds, and so you often don't know you got the wrong thing until something turns out completely
incorrect.

Suppose you want to extract the diagonal of a sub-part from X. All is well if we have more than one element:

> Y <- X[3:4, 3:4] 
> is.matrix(Y)
[1] TRUE
> diag(Y)
[1] 3 4

Suppose there's some calculation that ends up just needing row 2, column 2. We
have many cases like this, if we are taking elements from a covariance matrix,
for example. Try to take the diagonal from that part.

> Y <- X[2,2] 
> Y
[1] 2

Appears correct, yes?

That will fail horribly when we try to get diag from Y. We want just the number
"2", since the diagonal is the same as the whole matrix Y

> Y <- X[2,2] 
> diag(Y)
     [,1] [,2]
[1,]    1    0
[2,]    0    1

This makes the horror more obvious:

> X <- matrix( rnorm(16), nrow=4) 
> X
           [,1]       [,2]        [,3]       [,4]
[1,]  0.1731106  1.3570314 -1.62255241 -1.7614656
[2,] -1.0023882 -0.2765341  0.17372976 -1.2202468
[3,] -1.0521568 -0.4828088 -0.22638791  0.8388842
[4,]  1.4163970  0.9988008  0.04774671 -0.3987813
> Y <- X[2:3, 2:3] 
> diag(Y)
[1] -0.2765341 -0.2263879
> Y <- X[2, 2] 
> diag(Y)
<0 x 0 matrix>

The diag() function does not know how to respond to this:

diag(-0.2765341)

And why does it get -02765341 instead if X[2, 2]?

Answer: the drop curse.

Run it like this instead:

> Y <- X[2, 2, drop = FALSE]
> Y
           [,1]
[1,] -0.2765341
> diag(Y)
[1] -0.2765341

If I were designing R, I wouldn't do this. But R has a common source of feature, which is that it tries to "simplify" and "downgrade" objects where possible.

X[2,2]

is not a matrix with 1 element, it becomes just a vector with 1 element.

Adding the drop = FALSE as a third argument tells it "keep this as a matrix". If you want to choose a single column out of a matrix, but you want the result to still be a matrix with one column, rather than an R vector, which is quite a different thing, run

X[ , 2, drop = FALSE]

In other words, I want column 2, and don't try to simplify this for me by dropping the dimension.

Sometimes you'll see people invent a "little dance", checking each individual calculation to see if R has employed the automatic coercion to a smaller dimension.Â Its not necessary, if all transactions with a matrix use the drop = FALSE argument.

I guess I haven't really explained why this comes up, and why it hits the rockchalk package so hard. I have a lot of instances where the flow is like this. A data frame X exists. There might be 20 columns, and some criteria are applied and a few columns are chosen. All is well if the calculation is like this:

Xsub <- X[ , c("A", "B","C")]

to choose 3 columns. However, suppose the column choose variable is created separately, and it is used like this

myMagicChooserVector <- unknownFunction()
Xsub <- X[ , myMagicChooserVector]

If myMagicChooserVector has 2 or more variable names, all is well. But if myMagicChooserVector is c("B"), for example, then the Xsub result here is NOT a matrix.

Xsub <- X[ , myMagicChooserVector]

In the middle of some complicated calculation, you might be thinking that thing is a matrix still, and you ask for row 1, mistakenly

Xsub[1, ]

But you've shot yourself in the foot. Xsub is a vector now, it does not have rows or columns anymore. Run this for yourself, it is easy to see.

> X <- matrix(rnorm(9), ncol = 3)
> Xsub &ly;- X[ ,3]
> Xsub[1, ]
Error in Xsub[1, ] : incorrect number of dimensions

You can stay out of trouble if you remember to add drop = FALSE

> Xsub <- X[ , 3, drop = FALSE]

I would like to have a global option for my programs and packages so that drop
always equals FALSE. I would rather manually shrink things from matrices to
vectors than have to "defensively code" everything to avoid accidental demotion.

R’s drop “gotcha” and the diag “curse”

About pauljohn

Leave a Reply Cancel reply

Archives

Meta

Meta