I have a process that I have built to work in parallel using multiprocessing.Process within Python 2.7; it should work really fast in theory on an EC2 cluster with a ton of vCPU, but it isn't scaling as I expected. I'm running the code on a 96 vCPU machine (an m5.24xlarge instance), but while the function being parallelized runs in ~45 minutes on a 4 vCPU machine when being run on its own when I try and run 90 in parallel it takes 5+ hours for all of the sub-processes to finish.
I've considered using the Pool function to get away from the batching that occurs, but the function being called runs ~200 models that can really run a long time (and sometimes get stuck in weird optimization loops) so I have an additional process running in the background that will start sending soft Ctrl+C commands to the sub-process once it has 3 hours of processor time behind it every 10 minutes to ensure processing for any individual subprocess does not go on too long.
Each vCPU running a sub-process ranges between 40 and 99% utilized until the subprocess completes. My question, why does multiprocessing not linearly scale when moving to a bigger instance? I keep 5 vCPU available to run any background processes, so it isn't being bogged down there.
from multiprocessing import Process
import datetime
import Prod_Modeling_Pipeline as PMP
import boto3
import pandas
import time
import numpy
import os
#Define locations
bucketName = 'bucketgoeshere'
output_location = '/home/ec2-user/'
#Pull ATM Setter Over
client = boto3.client('s3')
transfer = boto3.s3.transfer.S3Transfer(client=client)
transfer.download_file(bucketName,'Root_Folder/Control_Files/'+'execution_file.csv', output_location+'execution_file.csv')
#Read-in id list
execution_data = pandas.read_csv(output_location+'execution_file.csv')
ids = execution_data['id']
ni = 90
id_row = [['AAA']*ni for _ in xrange(int(numpy.ceil(len(tids)/float(ni))))]
for i in xrange(len(ids)):
id_row[i/ni][i%ni] = ids[i]
Date = datetime.date.today().strftime('%Y-%m-%d')
totalstart = time.time()
for q in xrange(len(tid_row)):
processes = []
for m in xrange(len(tid_row[q])):
temp = tid_row[q]
try:
p = Process(target=PMP.Model_Function, args=(temp[m],Date,'VALIDATION'))
p.start()
processes.append(p)
time.sleep(1)
print("Started "+temp[m]+" as "+str(os.getpid()))
except:
print("Invalid Run")
for p in processes:
p.join()
print(processes)
print (time.time() - totalstart)