GNU Parallel has changed in the past 7 years. So today it can do it:
This example shows that more blocks are given to process 11 and 10 than process 4 and 5 because 4 and 5 read slower:
seq 100000001000000 |
parallel -j8 --tag --roundrobin --pipe --block 10001k 'pv -qL {}000000000 | wc' ::: 11 4 5 6 9 8 7 10