2

I asked a question very close to this, but it wasn't answered and since then I hope I have learned to better ask the question.

I was curious as to how run many jobs serially on a Cray XE6 machine. You usually qsub things with a ccmrun (for a serial job) or an aprun (instead of mpirun or mpiexec). I first wanted to use the Pool() function, but due to it not being SMP based hardware it would be limited to 32 processors. Even an mpi4py application of something like a pool wouldn't work, because I am not giving the main program all of the processors. I would be running that script 64 times if I were to say aprun -n 64 mpipool.py, whereas it does work if I do something like aprun -n 1 -d 32 pool.py.

I've had a look at the https://wiki.python.org/moin/ParallelProcessing website and was wondering if anyone had any experience running multiple serial jobs on a cluster computing machine with any of them. I did write an mpi4py code that basically had rank 0 doing all of the job selection, and then giving them out to the the other processors. It didn't want to play nice on the machine since I needed to use subprocess to launch the giant amount of C code. So, one last caveat is that it would have to play nice with subprocess.

I would like to have it look at the amount of nodes chosen, and then basically do something along the lines of:

ccmrun jobscheduler.py & ccmrun jobrunner.py 63 & # given that I started the job with 64 processors. I may have to do a bash loop here, but that's no problem.

Once started I would want them to be able to communicate between one another, but without MPI I'm not sure of an efficient way of doing this. If anyone could get me started on the right path I would greatly appreciate it. Maybe doing pickle dumps and locking them and deleting them when a jobrunner picks it up. There might be a really simple way of doing this, but I'm very new to this.

Thanks!

1 Answer 1

1

I don't know anything about Cray machines but I'll take a stab at this anyway. I noticed you mentioned qsub which makes me think that the system is using PBS or Torque. Both seem to support Job Arrays which may be along the lines of what you are looking for.

Job Arrays would make the queue system responsible for job management. Each subjob would be assigned an array id out of a range you specify and would be assigned whatever resources you requested with -l. In Torque, '#PBS -l nodes=1' and '#PBS -t 1-64' would create 64 subjobs with indexes from 1 to 64 each being assigned a single node. Man pages and Google will be a good resource and from what I've seen Torque and PBS differ on syntax. If that doesn't work, you can look at using pbsdsh inside of a single, larger job.

Also, I want to mention that advice from strangers on the Internet will only take you so far. Your local admin may have limits or scheduling policies in place that you may limit your options. You may also be able to get some advice from the admin about other, proven ways that you can solve your problem.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for your help. It is using a PBS scheduling system. The real problem is that the jobs I'm running are read from a database that can be deleted from and added to, so none of the jobs are locked down as being solid until the moment they are launched. I just made a find jobs class and added an iter and a next and iterating through it works fine. The problem is finding a way to have the scripts communicate with one another. I know it is a rather broad question, and I really appreciate you taking the time to offer assistance. The ccmrun is like what you are describing.
Could you add a column to your database that would include the job id? You could then do a 'select * from data where jobid=NULL',
That's what I was trying to do with the scheduler, but how do I then pass that on to the scripts waiting to run the job? I tried to include this in the scripts themselves instead of a scheduler, but ran into the problem that all of them were reading the database at the same time, and even though it is write protected except for one connection they are still reading that a job hasn't been ran and then all trying to run the same jobs.
I was wondering if that might be a problem. You might need another table where new jobids can be stored as nodes check in. Then you would have a doing 'job management' that would run every 10 seconds or so while your job is running. The node would do an "INSERT INTO JOBS (jobid) VALUES($PBS_JOBID)", sleep 60, "SELECT * FROM DATA WHERE jobid='$PBS_JOBID'". Sometime during the "sleep 60" the 'job management' process would obtain a jobid from the JOBS table and delete it. It would then insert that same jobid in the jobid column of your DATA table.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.