Quickest linear regression implementation in python

Question

I'm performing a stepwise model selection, progressively dropping variables with a variance inflation factor over a certain threshold.

In order to do this, I'm running OLS many, many times on datasets ranging from a few hundred MB to 10 gigs.

What is the quickest implementation of OLS would be for larger datasets? The Statsmodel OLS implementation seems to be using numpy to invert matrices. Would a gradient descent based method be quicker? Does scikit-learn have an especially quick implementation?

Or maybe an mcmc based approach using pymc is quickest...

Update 1: Seems that the scikit learn implementation of LinearRegression is a wrapper for the scipy implementation.

Update 2: Scipy OLS via scikit learn LinearRegression is twice as fast as statsmodels OLS in my very limited tests...

how many rows/observations and how many columns/explanatory variables do you have? — Josef
– Josef, Commented Jul 1, 2014 at 11:09

Kyle Kastner · Accepted Answer · 2014-07-02 13:48:36Z

3

The scikit-learn SGDRegressor class is (iirc) the fastest, but would probably be more difficult to tune than a simple LinearRegression.

I would give each of those a try, and see if they meet your needs. I also recommend subsampling your data - if you have many gigs but they are all samples from the same distibution, you can train/tune your model on a few thousand samples (dependent on the number of features). This should lead to faster exploration of your model space, without wasting a bunch of time on "repeat/uninteresting" data.

Once you find a few candidate models, then you can try those on the whole dataset.

edited Jul 2, 2014 at 13:48

answered Jul 1, 2014 at 8:17

Kyle Kastner

1,0188 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Chris Fonnesbeck · Accepted Answer · 2014-07-02 14:27:12Z

2

Stepwise methods are not a good way to perform model selection, as they are entirely ad hoc, and depend highly on which direction you run the stepwise procedure. Its far better to use criterion-based methods, or some other method for generating model probabilities. Perhaps the best approach is to use reversible-jump MCMC, which fits models over the entire models space, and not just the parameter space of a particular model.

PyMC does not implement rjMCMC itself, but it can be implemented. Note also that PyMC 3 makes it really easy to fit regression models using its new glm submodule.

answered Jul 2, 2014 at 14:27

Chris Fonnesbeck

4,2034 gold badges31 silver badges33 bronze badges

1 Comment

Luke Over a year ago

Good point. I could use a number of other approaches as well such as elasticnet, but I have my reasons.

Collectives™ on Stack Overflow

Quickest linear regression implementation in python

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related