i have a scikit-learn created model, a huge test dataset to predict. Now to speed up the prediction i want to implement multiprocessing, but really unable to crack it and need help in this regard.
import pandas as pd
from sklearn.externals import joblib
dataset = pd.read_csv('testdata.csv') # 8mln rows
feature_cols = ['col1', 'col2', 'col3']
#load model
model = joblib.load(model_saved_path) # random-forest classifier
#predict Function
def predict_func(model, data, feature_cols):
return model.predict(data[fetaure_cols])
#Normal Execution
predict_vals = predict_func(model, dataset, feature_cols) #130 secs
Now I want to use multiprocessing to predict, (chunk the datset and run predict function on each chunk separately in each core, then join back the result).
But not able to do so.
I have tried
import multiprocessing as mp
def mp_handler():
p = multiprocessing.Pool(3) #I think it starts 3 processes
p.map(predict_func, testData) #How to pass parameters
mp_handler()
I have no idea if this is way to do multiprocessing in python(forgive my ignorance here). I have read few search results and came up with this.
If somebody can help in coding, that will be a great help, or a link for read up on multiprocessing will be fair enough. Thanks.