1

i have a scikit-learn created model, a huge test dataset to predict. Now to speed up the prediction i want to implement multiprocessing, but really unable to crack it and need help in this regard.

import pandas as pd
from sklearn.externals import joblib
dataset = pd.read_csv('testdata.csv')  # 8mln rows
feature_cols = ['col1', 'col2', 'col3']

#load model
model = joblib.load(model_saved_path)                # random-forest classifier

#predict Function
def predict_func(model, data, feature_cols):
    return model.predict(data[fetaure_cols])

#Normal Execution
predict_vals = predict_func(model, dataset, feature_cols) #130 secs

Now I want to use multiprocessing to predict, (chunk the datset and run predict function on each chunk separately in each core, then join back the result).

But not able to do so.

I have tried

import multiprocessing as mp
def mp_handler():
    p = multiprocessing.Pool(3) #I think it starts 3 processes
    p.map(predict_func, testData) #How to pass parameters
mp_handler()

I have no idea if this is way to do multiprocessing in python(forgive my ignorance here). I have read few search results and came up with this.

If somebody can help in coding, that will be a great help, or a link for read up on multiprocessing will be fair enough. Thanks.

9
  • Consider using joblib. And also check your classifier/regressor if it's not already parallelized for this (shown code incomplete to decide on this)! Commented Nov 20, 2017 at 11:09
  • @sascha- Above is what i have wrote till now, if you have any sample-reference for "also check your classifier/regressor if it's not already parallelized for this", plz post the link. Thanks. Commented Nov 20, 2017 at 11:12
  • No. Probably you should post more information. And your comment adds nothing new. Yes, you wrote that and it does not work. I just mentioned an abstraction-layer which is the core of all sklearn-parallelizations (in terms of this kind of parallelization; ignoring SIMD or OpenMP). Commented Nov 20, 2017 at 11:13
  • information like?? Commented Nov 20, 2017 at 11:13
  • Reread my first comment. Commented Nov 20, 2017 at 11:14

1 Answer 1

3

You used a RandomForest (which i would have guessed because of slow prediction).

The takeaway message here is: it's already parallelized (ensemble-level!)! and all your attempts to do it on the outer-level will slow things down!

It's kind of arbitrary how i interpret these levels, but what i mean is:

  • lowest-level: the core-algorithm is parallel
    • Decision-tree is the core of RF; not parallel (in sklearn)!
    • affects single-prediction performance
  • medium-level: the ensemble-algorithm is parallel
    • RF = multiple Decision-trees: parallel (in sklearn)!
    • affects single-prediction performance
  • high-level: the batch-prediction is parallel
    • This is what you want to do and only makes sense if the lower levels do not exploit your capacities already!
    • does not affect single-prediction performance (as you know already)

The general rule is:

  • if using the correct arguments (e.g. n_jobs=-1; not default!):
    • RF will use min(number of cores, n_estimators) cores!
      • Speedup can only be achieved, if the above is lower than your number of cores!

So you should use the right n_jobs argument at training-time to use parallelization. sklearn will use this as explained and it can be seen here.

If you already trained your classifier with n_jobs=1 (not parallel), things get more difficult. It might work out to do:

# untested
model = joblib.load(model_saved_path)
#model.n_jobs = -1                     # unclear if -1 is substituted earlier
model.n_jobs = 4                       # more explicit usage

Keep in mind, that using n_jobs > 1 uses more memory!

Take your favorite OS-monitor, make sure you setup your classifier correctly (parallel -> n_jobs) and observe the CPU-usage during raw prediction. This is not for evaluating the effect of parallelization, but for some indication it is using parallelization!

If you still need parallelization, e.g. when having 32 cores and using n_estimators=10, then use joblib, the multiprocessing-wrapper by sklearn-people used a lot within sklearn. The basic examples should be ready to use!

If this will speed things up will depend on many many things (IO and co).

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for the detailed explanation @sascha, i have tried various combination of n_jobs and understood your points clearly(has lot to learn on Machine Learning). Thanks Again :).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.