I came across some weird behavior when using the wait command for running parallel jobs in a bash script. For the sake of simplicity I have reduced the problem to the following bash script:
#!/bin/bash
test_func() {
echo "$(date +%M:%S:%N): start $1"
sleep $1
echo "$(date +%M:%S:%N): end $1"
}
i=0
for j in {5..9}; do
test_func $j &
((i++))
sleep 3
done
echo "$(date +%M:%S:%N): No new processes, waiting for all to finish"
while [ $(pgrep -c -P$$) -ge 1 ]; do
echo "$(date +%M:%S:%N): $(pgrep -P$$ -d' ')"
wait -n $(pgrep -P$$ -d' ')
echo "$(date +%M:%S:%N): next $i"
((i++))
done
The above script spawns 5 parallel runs of the test_func function, which each wait for j seconds. I've added time stamps to each output to show the timings. The output of running this script is as follows:
03:53:854843895: start 5
03:56:855729952: start 6
03:58:856136029: end 5
03:59:856388725: start 7
04:02:857016376: end 6
04:02:857508665: start 8
04:05:857895265: start 9
04:06:857738397: end 7
04:08:858666941: No new processes, waiting for all to finish
04:08:864528182: 3837265 3837297
04:08:875479745: next 5
04:08:881049792: 3837265 3837297
04:08:892058494: next 6
04:08:899310728: 3837265 3837297
04:08:910466324: next 7
04:08:916130505: 3837265 3837297
04:10:858746305: end 8
04:10:859380011: next 8
04:10:864975972: 3837297
04:14:859172632: end 9
04:14:859818377: next 9
As can be seen from the output above, the script spawns all 5 processes, of which 3 end before the end of the for loop (due to the sleep 3). At this point there are 2 processes still running, which are given correctly by the pgrep command with IDs 3837265 and 3837297. However the wait command in the while loop then immediately returns (< 0.1 seconds) for the next three calls, without any other processes finishing (shown with the pgrep command), even despite giving it the process IDs to wait for.
As far as I can tell (and from some experimentation) the wait command is immediately returning for each of the test_func calls that finished before it was first called (which in this case is three times), before actually waiting. What I don't understand is why this is the case, especially since I supply the process IDs to wait for.
I'm using Ubuntu 20.04.6 and GNU bash, version 5.0.17(1) for context.
pgrep3 times in each iteration of the loop. Are you sure it returns the same result every time?pgrepcommand to a variable:pg="$(pgrep -P$$ -d' ')"which is updated for every iteration of the while loop and used in all 3 locations gives the same result.wait -ninstead ofwait?whileloop) while 1 displayed the same results as your host (ie, 5 passes through thewhileloop); 2 passes (expected): Ubuntu 22.04.1 /bash 5.1.16, Ubuntu 22.04.5 /bash 5.1.16... 5 passes (wrong): Ubuntu 20.04.06 /bash 5.0.17; at this point I'm guessing there's an issue with the olderbashversion; reviewing bash release/changes may shine a light on this issue'; also,set -mdid not make a difference for mewait -n(no pid list) then both versions ofbash(5.0.17, 5.1.16) show the same 'incorrect' behavior of 5 passes through thewhileloop, with the first 3 passes taking place in rapid succession (as in OP's case)