Multithread Python with EC2

Question

I have a process that I have built to work in parallel using multiprocessing.Process within Python 2.7; it should work really fast in theory on an EC2 cluster with a ton of vCPU, but it isn't scaling as I expected. I'm running the code on a 96 vCPU machine (an m5.24xlarge instance), but while the function being parallelized runs in ~45 minutes on a 4 vCPU machine when being run on its own when I try and run 90 in parallel it takes 5+ hours for all of the sub-processes to finish.

I've considered using the Pool function to get away from the batching that occurs, but the function being called runs ~200 models that can really run a long time (and sometimes get stuck in weird optimization loops) so I have an additional process running in the background that will start sending soft Ctrl+C commands to the sub-process once it has 3 hours of processor time behind it every 10 minutes to ensure processing for any individual subprocess does not go on too long.

Each vCPU running a sub-process ranges between 40 and 99% utilized until the subprocess completes. My question, why does multiprocessing not linearly scale when moving to a bigger instance? I keep 5 vCPU available to run any background processes, so it isn't being bogged down there.

from multiprocessing import Process
import datetime
import Prod_Modeling_Pipeline as PMP
import boto3
import pandas
import time
import numpy
import os

#Define locations
bucketName = 'bucketgoeshere'
output_location = '/home/ec2-user/'

#Pull ATM Setter Over
client = boto3.client('s3')
transfer = boto3.s3.transfer.S3Transfer(client=client)
transfer.download_file(bucketName,'Root_Folder/Control_Files/'+'execution_file.csv', output_location+'execution_file.csv')

#Read-in id list
execution_data = pandas.read_csv(output_location+'execution_file.csv')
ids = execution_data['id']

ni = 90
id_row = [['AAA']*ni for _ in xrange(int(numpy.ceil(len(tids)/float(ni))))]

for i in xrange(len(ids)):
    id_row[i/ni][i%ni] = ids[i]

Date = datetime.date.today().strftime('%Y-%m-%d')

totalstart = time.time()
for q in xrange(len(tid_row)):
    processes = []
    for m in xrange(len(tid_row[q])):
        temp = tid_row[q]
        try:
            p = Process(target=PMP.Model_Function, args=(temp[m],Date,'VALIDATION'))
            p.start()
            processes.append(p)
            time.sleep(1)
            print("Started "+temp[m]+" as "+str(os.getpid()))
        except:
            print("Invalid Run")

    for p in processes:
        p.join()

    print(processes)

print (time.time() - totalstart)

JJFord3 · Accepted Answer · 2018-07-20 15:34:04Z

1

I think I understand now why it isn't linearly scaling. It all comes down to clock speed between a t2 EC2 instance and a m EC2 instance. Max clock speed is much higher for the smaller instances... up to 3.3 GHz for small t2s and 2.5 GHz for m type instances.

(https://aws.amazon.com/ec2/instance-types/)

That will limit scale-ability when you change to the larger instance type because you moved to a slower max clock speed.

This isn't all of my above problem, but it explains a portion of the time increase.

Another portion appears to be due using a shared processor so that even though the EC2 should take less time, someone else in my org is hogging processing power. Unsure how to fix that under corporate constraints.

edited Jul 20, 2018 at 15:34

answered Jul 19, 2018 at 13:08

JJFord3

1,9852 gold badges26 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Multithread Python with EC2

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related