3

I came across some weird behavior when using the wait command for running parallel jobs in a bash script. For the sake of simplicity I have reduced the problem to the following bash script:

#!/bin/bash

test_func() {
  echo "$(date +%M:%S:%N): start $1"
  sleep $1
  echo "$(date +%M:%S:%N): end $1"
}

i=0
for j in {5..9}; do
  test_func $j &
  ((i++))
  sleep 3
done
echo "$(date +%M:%S:%N): No new processes, waiting for all to finish"
while [ $(pgrep -c -P$$) -ge 1 ]; do
  echo "$(date +%M:%S:%N): $(pgrep -P$$ -d' ')"
  wait -n $(pgrep -P$$ -d' ')
  echo "$(date +%M:%S:%N): next $i"
  ((i++))
done

The above script spawns 5 parallel runs of the test_func function, which each wait for j seconds. I've added time stamps to each output to show the timings. The output of running this script is as follows:

03:53:854843895: start 5
03:56:855729952: start 6
03:58:856136029: end 5
03:59:856388725: start 7
04:02:857016376: end 6
04:02:857508665: start 8
04:05:857895265: start 9
04:06:857738397: end 7
04:08:858666941: No new processes, waiting for all to finish
04:08:864528182: 3837265 3837297
04:08:875479745: next 5
04:08:881049792: 3837265 3837297
04:08:892058494: next 6
04:08:899310728: 3837265 3837297
04:08:910466324: next 7
04:08:916130505: 3837265 3837297
04:10:858746305: end 8
04:10:859380011: next 8
04:10:864975972: 3837297
04:14:859172632: end 9
04:14:859818377: next 9

As can be seen from the output above, the script spawns all 5 processes, of which 3 end before the end of the for loop (due to the sleep 3). At this point there are 2 processes still running, which are given correctly by the pgrep command with IDs 3837265 and 3837297. However the wait command in the while loop then immediately returns (< 0.1 seconds) for the next three calls, without any other processes finishing (shown with the pgrep command), even despite giving it the process IDs to wait for.

As far as I can tell (and from some experimentation) the wait command is immediately returning for each of the test_func calls that finished before it was first called (which in this case is three times), before actually waiting. What I don't understand is why this is the case, especially since I supply the process IDs to wait for.

I'm using Ubuntu 20.04.6 and GNU bash, version 5.0.17(1) for context.

10
  • You're running pgrep 3 times in each iteration of the loop. Are you sure it returns the same result every time? Commented Sep 20, 2024 at 13:23
  • @choroba Moving all calls to the pgrep command to a variable: pg="$(pgrep -P$$ -d' ')" which is updated for every iteration of the while loop and used in all 3 locations gives the same result. Commented Sep 20, 2024 at 13:34
  • Why are you using wait -n instead of wait? Commented Sep 20, 2024 at 13:43
  • 1
    I ran your script 3 times on 3 different hosts; 2 displayed the expected results (ie, 2 passes through the while loop) while 1 displayed the same results as your host (ie, 5 passes through the while loop); 2 passes (expected): Ubuntu 22.04.1 / bash 5.1.16, Ubuntu 22.04.5 / bash 5.1.16 ... 5 passes (wrong): Ubuntu 20.04.06 / bash 5.0.17; at this point I'm guessing there's an issue with the older bash version; reviewing bash release/changes may shine a light on this issue'; also, set -m did not make a difference for me Commented Sep 20, 2024 at 13:49
  • 1
    also of some interest: if you change the code to just wait -n (no pid list) then both versions of bash (5.0.17, 5.1.16) show the same 'incorrect' behavior of 5 passes through the while loop, with the first 3 passes taking place in rapid succession (as in OP's case) Commented Sep 20, 2024 at 14:19

2 Answers 2

4

The man page for Bash 5.0 says:

If the -n option is supplied, wait waits for any job to terminate and returns its exit status.

The man page for Bash 5.1 says:

If the -n option is supplied, wait waits for a single job from the list of ids or, if no ids are supplied, any job, to complete and returns its exit status.

And the CHANGES file in the source lists this change between bash-5.1-alpha and bash-5.1-beta:

  1. New Features in Bash

ll. wait -n now accepts a list of job specifications as arguments and will wait for the first one in the list to change state.

I'm reading an implication there that pre-5.1 wait -n ignored any arguments it was given, so that it would always just wait for any job to finish.

That matches what I can see with Bash 5.0 and Bash 5.1.

A slightly shorter test, using the exit statuses to see which job wait found:

$ cat bg.sh
#!/bin/bash

echo "BASH_VERSION=$BASH_VERSION"
echo 'starting jobs...'
(sleep 1; exit 1) &
(sleep 5; exit 2) &
echo "please hold..."
sleep 2
# by this point job 1 should have exited
# try to wait for job 2 specifically
wait -n %2
echo "wait found job $?"
echo "jobs running now:"
jobs

Bash 5.1 finds job 2 as asked:

$ bash bg.sh
BASH_VERSION=5.1.16(1)-release
starting jobs...
please hold...
wait found job 2
jobs running now:
[1]   Exit 1                  ( sleep 1; exit 1 )

Bash 5.0 finds the already exited job 1 (and job 2 is still running when the script finishes):

$ bash-5.0/bash bg.sh
BASH_VERSION=5.0.0(1)-release
starting jobs...
please hold...
wait found job 1
jobs running now:
[2]+  Running                 ( sleep 5; exit 2 ) &

As symcbean points out in their answer, Bash's wait works similarly to the system call wait(), in that it returns immediately if an already-exited child exists. Part of the job of "waiting" is to pass the exit status of the child to the user program, and it would not be a good idea to have the exit status get lost if the parent was too slow to wait for the child in advance. (Especially since a terminating child process sends a SIGCHLD to the parent when it dies, and a parent reacting to that signal can only ever wait() post the child's exit.)

1
  • Thanks for the details and reference to the changelog. The functionality of wait now makes sense but wasn't quite clear from just the man page. I'll have to use something different to just wait -n for my purposes, but at least I understand it now :) Commented Sep 23, 2024 at 6:53
3

the wait command is immediately returning for each of the test_func calls that finished before it was first called

Yes. That's expected.

When you run something in the background, when it exits, it becomes a zombie process. The parent process (usually the thing that started it) needs to reap the pid. Bash does this almost immediately, but it also maintains its own list of the processes it has started (see jobs). The dead process has been removed the kernel process list but is still present in Bash's job list until it is deleted by wait.

1
  • Thanks so much for this concise response. I think the bash manual is a bit misleading on what exactly wait will return for, but yours and ilkkachu's answers helped clear it up! Commented Sep 23, 2024 at 6:49

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.