Memory issue when using factor for fixed effect regression

Question

I am working with a data with 900,000 observations. There is a categorical variable x with 966 unique value that needs to be used as fixed effects. I am including fixed effects using factor(x) in the regression. It gives me an error like this

Error: cannot allocate vector of size 6.9 Gb

How to fix this error? or do I need to do something different in the regression for fixed effects?

Then, how do I run a regression like this:

rlm(y~x+ factor(fe), data=pd)

996 columns times 9e5 is 896400000. Times 4 bytes per integer is 3.5 Gb. And R generally needs more memory for its internal operations. — Rui Barradas
– Rui Barradas, Commented Feb 23, 2020 at 20:06

Ben Bolker · Accepted Answer · 2020-02-24 00:42:03Z

1

The set of dummy variables constructed from a factor has very low information content. For example, considering only the columns of your model matrix corresponding to your 966-level categorical predictor, each row contains exactly one 1 and 965 zeros.

Thus you can generally save a lot of memory by constructing a sparse model matrix using Matrix::sparse.model.matrix() (or MatrixModels::model.Matrix(*, sparse=TRUE) as suggested by the sparse.model.matrix documentation). However, to use this it's necessary for whatever regression machinery you're using to accept a model matrix + response vector rather than requiring a formula (for example, to do linear regression you would need sparse.model.matrix + lm.fit rather than being able to use lm).

In contrast to @RuiBarradas's estimate of 3.5Gb for a dense model matrix:

m <- Matrix::sparse.model.matrix(~x,
     data=data.frame(x=factor(sample(1:966,size=9e5,replace=TRUE))))
format(object.size(m),"Mb")
## [1] "75.6 Mb"

If you are using the rlm function from the MASS package, something like this should work:

library(Matrix)
library(MASS)
mm <- sparse.model.matrix(~x + factor(fe), data=pd)
rlm(y=pd$y, x=mm, ...)

Note that I haven't actually tested this (you didn't give a reproducible example); this should at least get you past the step of creating the model matrix, but I don't know if rlm() does any internal computations that would break and/or make the model matrix non-sparse.

edited Feb 24, 2020 at 0:42

answered Feb 23, 2020 at 20:38

Ben Bolker

230k26 gold badges405 silver badges497 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

D_B Over a year ago

Thanks! But to run rlm(y~x+ factor(fe), data=pd), what should I do exactly? How do I write your above command for it?

Ben Bolker Over a year ago

I've answered this. Note that this is the kind of information that's really useful to include in your question in the first place ...

D_B Over a year ago

No! I got the same error, this time 5.9 Gb. I simplified my problem above: I have two series of fixed effects, one has 966 unique values, the other has 170 unique values. Is it why I get the error?

Ben Bolker Over a year ago

are both factors? That is, is your model y~ factor(x1)+factor(x2) ? As I said, it's possible that MASS::rlm does some internal computation at some point that coerces the model matrix to dense ...

Collectives™ on Stack Overflow

Memory issue when using factor for fixed effect regression

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related