python multiprocessing - single file multiple commands

Ask Question

Asked 7 years, 9 months ago

Modified 7 years, 9 months ago

Viewed 404 times

I've a requirement of processing a file which contains some 100 shell (bash) commands; each line has a separate shell command. I have to execute these commands parallely (like 10 commands in parallel or 20, let the CPU decide how to do that in parallel). I honestly don't know how to accomplish it so I took a code somewhere around here only; below is the same:

from subprocess import PIPE
import subprocess
import time


def submit_job_max_len(job_list, max_processes):
  sleep_time = 0.1
  processes = list()
  for command in job_list:
    print 'running process# {n}. Submitting {proc}.'.format(n=len(processes),
        proc=str(command))
    processes.append(subprocess.Popen(command, shell=False, stdout=None, stdin=PIPE))
    while len(processes) >= max_processes:
      time.sleep(sleep_time)
      processes = [proc for proc in processes if proc.poll() is None]
  while len(processes) > 0:
    time.sleep(sleep_time)
    processes = [proc for proc in processes if proc.poll() is None]


cmd = 'cat runCommands.sh'
job_list = ((cmd.format(n=i)).split() for i in range(5))
submit_job_max_len(job_list, max_processes=10)

I don't understand the last 3 lines as to what actually it's doing. My test runs show that the number in range(n) will execute ever line that many times. So if the number is 5, then every line is executed 5 times. I don't want that. Can someone throw some light on this please. And again, please excuse my ignorance here.

edited Feb 18, 2018 at 22:40

Isaac

11.8k5 gold badges35 silver badges45 bronze badges

asked Feb 18, 2018 at 22:37

knowone

8403 gold badges18 silver badges40 bronze badges

Add a comment |

2 Answers 2

Sorted by:

GNU Parallel is made for you:

cat the_file | parallel

By default it will run one job per cpu-core. This can be adjusted with --jobs.

GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

Simple scheduling

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

GNU Parallel scheduling

Installation

For security reasons you should install GNU Parallel with your package manager, but if GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README

Learn more

See more examples: http://www.gnu.org/software/parallel/man.html

Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html

edited Feb 19, 2018 at 1:16

answered Feb 19, 2018 at 0:43

Ole Tange

34.1k9 gold badges93 silver badges111 bronze badges

8 Comments

knowone Over a year ago

Thanks Ole. As much as this looks appealing(gnu fan btw) , I won't be able to use this solution for the query here. But thanks for bringing this here. Upping the answer anyways.

Ole Tange Over a year ago

Can you elaborate on why you will not be able to use the solution? Is it covered by oletange.blogspot.dk/2013/04/why-not-install-gnu-parallel.html

knowone Over a year ago

To use it, I'll have to have GNU parallel package present on the cluster. Which last time I checked wasn't present. My administrator won't allow the installation unless it's a proven case to them.

Ole Tange Over a year ago

But you are allowed to run your own scripts? As mentioned on oletange.blogspot.dk/2013/04/why-not-install-gnu-parallel.html you do not need root to do a personal installation. Also --embed (from 20180222) may be useful to you.

knowone Over a year ago

No mate, not even Internet connection. But I did try on my local. Does the job but yeah, the actual parallelism I couldn't reckon there.

What you need is a queue.

Use the multiprocessing package to start a set of processes. There are several examples which show how to do this.

One neat trick is to use a poison pill to ensure each of the processes is killed once the queue is empty. Search the python module of the week for examples on this.

Best of luck.

answered Feb 18, 2018 at 22:55

polarise

2,4331 gold badge22 silver badges30 bronze badges

Comments

Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

python multiprocessing - single file multiple commands

2 Answers 2

8 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related