1

Is there a way to parallelize multiple model-building procedures in scikit-learn? I know that I can use the n_jobs argument in both GridSearchCV and cross_validate to achieve some sort of parallelization within one model building procedure. However, I am running multiple model-building procedures in a for-loop with different input parameters and save the results in a list. Just as an example, suppose I have 15 free CPUs and I am using n_jobs=5 in cross_validate. If I am not mistaken, that means that one single model-building procedure uses 5 CPUS. Now is there a way to already start the next 2 model-building procedures in my for-loop so I am using all 15 CPUS? Here's a dummy example:

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, GridSearchCV, cross_validate

# load breast cancer data set
X,y = load_breast_cancer(return_X_y=True)

# define different types of penalty strategies
# let's make a toy example and pretend we would be interested in
# running different penalty strategies (I use three times 'l2' here,
# but imagine these would be different)
penalty_types = ['l2','l2','l2']

# define output list where we add the results using different penalty strategies
nested_cv_scores_list = []

for penalty_type in penalty_types:
    
    # create a random number generator
    rng = np.random.RandomState(42)

    # z-standardize features
    scaler = StandardScaler()
    
    # use linear L2-regularized Logistic Regression as classifier
    lr = LogisticRegression(random_state=rng,penalty=penalty_type)
    
    # define parameter grid to optimize over (optimize C)
    lr_c = np.linspace(start=1,stop=16,num=11,endpoint=True)
    p_grid = {'lr__C':lr_c}
    
    # create pipeline
    lr_pipe = Pipeline([
        ('scaler',scaler),
        ('lr',lr)
        ])
    
    # define cross validation strategy
    cv = KFold(shuffle=True,random_state=rng)
    
    # implement GridSearch (inner cross validation)
    grid = GridSearchCV(lr_pipe,param_grid=p_grid,cv=cv)
    
    # implement cross_validate (outer cross validation)
    nested_cv_scores = cross_validate(grid,X,y,cv=cv,n_jobs=5)

    # append result to list
    nested_cv_scores_list.append(nested_cv_scores)

Is there a way to parallelize this for-loop?

4
  • If you set n_jobs = -1, it will use all available CPUs. Commented Apr 24, 2021 at 0:25
  • I know, but this will only affect the parallelization for one of my model-building procedures within my for-loop (so in my example: use 5 CPUs for 'l1' then 5 CPUs for 'l2' and finally 5 CPUs for 'elastic'). But I would like to parallelize the model-building procedures. A 'meta'-parallelization if you would want to call it that way. Commented Apr 26, 2021 at 9:32
  • I think GridSearchCV is capable of carrying out all of these calculations in parallel without having to creating a for-loop. You just need to pass'em as usual. Unless ya really have specific reason for the for-loop (which I failed to see), then you can create your own GridSearchCV and use a multi-processing (i.e. pool) approach inside the for-loop. Commented Apr 26, 2021 at 16:17
  • Of course, the script above is just a simplified example. I have systematic/fixed differences between my model-building procedures (e.g. in my example I pretend to use three different penalty strategies), that I would like to compare. This is what the for-loop is for. I do not want to optimize those as hyperparameters as I want to make sure that certain parameters stay constant within my model-building procedures ("Run three nested-cross validations with fixed parameters a, b, c"). I am not sure how GridSearchCV can solve this? Commented Apr 28, 2021 at 14:03

1 Answer 1

1

joblib.parallel is made for this job! Just put your loop content in a function and call it using Parallel and delayed

from joblib.parallel import Parallel, delayed
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, GridSearchCV, cross_validate

# load breast cancer data set
X,y = load_breast_cancer(return_X_y=True)

# define different types of penalty strategies
# let's make a toy example and pretend we would be interested in
# running different penalty strategies (I use three times 'l2' here,
# but imagine these would be different)
penalty_types = ['l2','l2','l2']

# define output list where we add the results using different penalty strategies
nested_cv_scores_list = []

# put rng-seed outside of loop so that not all results are the same
rng = np.random.RandomState(42)

def run_as_job(penalty_type, X, y):

    # create a random number generator
    

    # z-standardize features
    scaler = StandardScaler()
    
    # use linear L2-regularized Logistic Regression as classifier
    lr = LogisticRegression(random_state=rng,penalty=penalty_type)
    
    # define parameter grid to optimize over (optimize C)
    lr_c = np.linspace(start=1,stop=16,num=11,endpoint=True)
    p_grid = {'lr__C':lr_c}

    .... # additional calculation that is missing in the example
    .... # e.g. res = cross_val_score(clf, X, y, n_jobs=2)
    return res

if __name__ == '__main__':
    results = Parallel(n_jobs=2)(delayed(run_as_job)(penalty_type) for penalty_type in penalty_types)

for more usage options have a look at joblib: Embarrassingly parallel for loops

Sign up to request clarification or add additional context in comments.

5 Comments

I already thought about this but wasn't sure if this is possible? How is Parallel(n_jobs=2)... (the "most outer loop") interacting with cross_validate(...,n_jobs) as they both define the number of CPUs to use? So for example in your script you use Parallel(n_jobs=2)...If I get the docs right that should define that two iterations are running in parallel each of them occupying one CPU. Doesn't this automatically mean that everything that is nested within that is restricted to only one single CPU which would contradict with cross_validate(n_jobs=5)?
Parallel will spawn two processes, each running the function once. Within this function cross_validate(n_jobs=5) will spawn another 5 processes. So in total there will be a maximum of 10 processes running. You'll have to try out a bit how to utilize all cores optimally, sometimes 2x5 will run slower than 5x2, depending entirely on how well different sub-tasks parallelize, and what the overhead of function calls is. Parallelization in Python is sometimes a bit tricky, as threading only runs on a single CPU due to the GIL. Only spawning processes will utilize all CPUs.
This does exactly what GridSearchCV does with n_jobs=2!
@Yahya why do you think that? Parallel(n_jobs=5) spawns 5 processes, and GridSearchCV(X,y,n_jobs=2) spawns another 2 processes each, so it's more. Depending on how many folds you are doing this approach makes a lot of sense.
No, this is wrong. Firstly, everything in your code (and in the OP's example) is very lightweight. Hence, the only thing that would require multi-processing is trying different parameters, and that's what GridSearchCV does in parallelized way without the need for the outer for-loop. Besides, if OP wants to check the meta results for every model with different parameters hold fixed, they can via GridSearchCV too. Finally, the resources on any machine are limited, therefore, throwing all these processes almost surely result in throttling of high performance computers CPU or GPU.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.