My main FORTRAN MPI code reaches a point where all processes call a script. The codelooks something like
write(syscommand,'(a131xi3)') './vscript.csh' my_mpi_proc_num
rc=system(syscommand)
Now, this section of code loops though over a hundred times, and the script runs fine on all processes. Then, randomly as far as I can tell, some process will enter system and then will return an error code of 32512. A few other things then happen (sorry I can't show much more code. My employer would not be too happy.), then an MPI_ABORT is called and all the processes die. I am told that 32512 is often the error code returned when a command cannot be found. This is unlikely because, as I have indicated, the script is found hundreds of times before this crash, and nothing is moving it around.
I seem to have found a stop gap measure:
write(syscommand,'(a131xi3)') './vscript.csh' my_mpi_proc_num
rc=32512
num_attempts=0
do while (num_attempts<100 .and. rc==32512)
num_attempts=num_attempts+1
rc=system(syscommand)
enddo
i.e. each process will try 100 times to get past the 32512 thing. Although I am sure this is horrible code, it is working.
So, anyone have a clue why I am getting this error? A thought: If two processes try to run the same script near simultaneously, will one of them be kicked out and forced to return that 32512? Thanks.
fork(2)(system(3)calls it) is dangerous and not supported in most MPI implementations on Linux - the child process might segfault.