4

I'm performing a stepwise model selection, progressively dropping variables with a variance inflation factor over a certain threshold.

In order to do this, I'm running OLS many, many times on datasets ranging from a few hundred MB to 10 gigs.

What is the quickest implementation of OLS would be for larger datasets? The Statsmodel OLS implementation seems to be using numpy to invert matrices. Would a gradient descent based method be quicker? Does scikit-learn have an especially quick implementation?

Or maybe an mcmc based approach using pymc is quickest...

Update 1: Seems that the scikit learn implementation of LinearRegression is a wrapper for the scipy implementation.

Update 2: Scipy OLS via scikit learn LinearRegression is twice as fast as statsmodels OLS in my very limited tests...

2
  • how many rows/observations and how many columns/explanatory variables do you have? Commented Jul 1, 2014 at 11:09
  • About a hundred observations and rows in the millions Commented Jul 1, 2014 at 17:13

2 Answers 2

3

The scikit-learn SGDRegressor class is (iirc) the fastest, but would probably be more difficult to tune than a simple LinearRegression.

I would give each of those a try, and see if they meet your needs. I also recommend subsampling your data - if you have many gigs but they are all samples from the same distibution, you can train/tune your model on a few thousand samples (dependent on the number of features). This should lead to faster exploration of your model space, without wasting a bunch of time on "repeat/uninteresting" data.

Once you find a few candidate models, then you can try those on the whole dataset.

Sign up to request clarification or add additional context in comments.

Comments

2

Stepwise methods are not a good way to perform model selection, as they are entirely ad hoc, and depend highly on which direction you run the stepwise procedure. Its far better to use criterion-based methods, or some other method for generating model probabilities. Perhaps the best approach is to use reversible-jump MCMC, which fits models over the entire models space, and not just the parameter space of a particular model.

PyMC does not implement rjMCMC itself, but it can be implemented. Note also that PyMC 3 makes it really easy to fit regression models using its new glm submodule.

1 Comment

Good point. I could use a number of other approaches as well such as elasticnet, but I have my reasons.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.