Like any language, R can be frustrating. I love R but I have had my share of frustrating moments with it. When these frustrations are especially severe, it can seem like one is trapped in The R Inferno - a title I have stolen from Patrick Burns who wrote a book with that title, of course parodying Dante's Inferno.
This post is about one brief foray I had into a circle of The R Inferno, and how I got out. R experts may think of my problem as elementary, but I was stumped by it for a while, and I thought that by posting something about it I may help some other wandering soul.
I was working with a very large dataset, but I'll illustrate the problem with a tiny toy dataset. The task I was trying to perform was not, I think, very uncommon. I was using the plm package to predict an outcome variable using a factor variable. For purposes of illustration, look at this fake panel data. Here, "id" is the unit being studied over time (maybe a firm), "time" is either 1 or 2 for when the outcome was observed, "f" is the factor variable being used as a predictor, and "outcome" is the outcome variable:
Looks simple enough, right? Now we combine them into a dataset:
Now (I can see in retrospect) I took the innocent-seeming misstep that led to my condemnation to the R inferno. I shortened the dataset to remove one outlier:
Finally, I ran a regression of the outcome variable on the factor variable in this shortened dataset:
Here was the error I got:
Error in x[, !na.check] : (subscript) logical subscript too long
I was in the inferno. One challenge to debugging is that the errors are from commands that you didn't think you called. In my case, I called plm, which runs a lot of code including, apparently, "x[,!na.check]". Another challenge to debugging is that the error messages aren't easily interpretable. "Logical subscript too long" doesn't make much sense to me.
Since I don't have the storytelling skills of Dante, I won't tell you about my whole trip through the inferno. I'll just tell you the solution, or at least what I understand of the solution. It turns out that removing that one outlier was the cause of my problems. This is because I removed the only instance of the factor "4" in the data. R knew that f didn't have a "4" in it, but it also remembered the list of unique levels of f that were encoded when f was created. The mismatch between the number of unique levels that f was supposed to have, and the observed levels that it did have in the shortened dataset, was enough to cause this strange error to occur in plm. Beware, reader, that you don't make my mistake.
In general, I have had several frustrating moments with factors in R. Another frustration I have is with (the lack of) casting. If I, for example, define the variable "f" as above, and then I try to do this:
I get this output:
 NA NA NA NA NA NA NA NA NA NA
In Ops.factor(f, 1) : ‘+’ not meaningful for factors
Fair enough, I suppose. Factors are different from numbers and we can't treat them exactly the same. But why not? Is it impossible that R could be smart enough to cast f as numeric, then add 1 to it, then cast it back to a factor? I am not enough of an expert on programming languages to know whether or why that would be unfeasible for R to do. Maybe there are deeper reasons for the factor-related frustrations I've described in this post, and I just don't understand them. I think the bottom line is simply to beware when working with factor variables - they can act in strange ways and cause unforeseen errors. Though I love R, I think it's important to know and talk about these potential pitfalls.