2

I am working with a data with 900,000 observations. There is a categorical variable x with 966 unique value that needs to be used as fixed effects. I am including fixed effects using factor(x) in the regression. It gives me an error like this

Error: cannot allocate vector of size 6.9 Gb

How to fix this error? or do I need to do something different in the regression for fixed effects?

Then, how do I run a regression like this:

rlm(y~x+ factor(fe), data=pd)
3
  • 1
    996 columns times 9e5 is 896400000. Times 4 bytes per integer is 3.5 Gb. And R generally needs more memory for its internal operations. Commented Feb 23, 2020 at 20:06
  • Right! So how do people run regressions like this in R? Commented Feb 23, 2020 at 20:22
  • Apart from buying more memory? Here are some tips. Commented Feb 23, 2020 at 20:28

1 Answer 1

1

The set of dummy variables constructed from a factor has very low information content. For example, considering only the columns of your model matrix corresponding to your 966-level categorical predictor, each row contains exactly one 1 and 965 zeros.

Thus you can generally save a lot of memory by constructing a sparse model matrix using Matrix::sparse.model.matrix() (or MatrixModels::model.Matrix(*, sparse=TRUE) as suggested by the sparse.model.matrix documentation). However, to use this it's necessary for whatever regression machinery you're using to accept a model matrix + response vector rather than requiring a formula (for example, to do linear regression you would need sparse.model.matrix + lm.fit rather than being able to use lm).

In contrast to @RuiBarradas's estimate of 3.5Gb for a dense model matrix:

m <- Matrix::sparse.model.matrix(~x,
     data=data.frame(x=factor(sample(1:966,size=9e5,replace=TRUE))))
format(object.size(m),"Mb")
## [1] "75.6 Mb"

If you are using the rlm function from the MASS package, something like this should work:

library(Matrix)
library(MASS)
mm <- sparse.model.matrix(~x + factor(fe), data=pd)
rlm(y=pd$y, x=mm, ...)

Note that I haven't actually tested this (you didn't give a reproducible example); this should at least get you past the step of creating the model matrix, but I don't know if rlm() does any internal computations that would break and/or make the model matrix non-sparse.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks! But to run rlm(y~x+ factor(fe), data=pd), what should I do exactly? How do I write your above command for it?
I've answered this. Note that this is the kind of information that's really useful to include in your question in the first place ...
No! I got the same error, this time 5.9 Gb. I simplified my problem above: I have two series of fixed effects, one has 966 unique values, the other has 170 unique values. Is it why I get the error?
are both factors? That is, is your model y~ factor(x1)+factor(x2) ? As I said, it's possible that MASS::rlm does some internal computation at some point that coerces the model matrix to dense ...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.