Pausing and then resuming a piped command

Question

I'm downloading a lot of data for my research. The data is being downloaded on one of my campus's supercomputers, but data downloads are interrupted every hour. When the OS pauses the pipeline, I have to delete all of the lines of the text file that represent the files that have already been downloaded. Not hard, but annoying and I would prefer not to do that. Here is how I am downloading everything

cat subset.txt | tr -d '\r' | xargs -P 4 -n 1 curl -LJO -s -n --globoff -c ~/.urs_cookies -b ~/.urs_cookies

Each url is passed to curl and xargs gives me 4 parallel downloads. Is there a way to pause the entire pipeline and continue the pipeline later on?

You can send SIGSTOP and SIGCONT to the cat process, if this is what you mean. — Devolus
– Devolus, Commented May 11, 2021 at 14:08
That does not seem to work. I think I need to pause xargs instead since all of the URLs have already been processed by the first 2 steps of the pipe — K. Shores
– K. Shores, Commented May 11, 2021 at 14:17
What do you mean by "the OS pauses the pipeline"? Are the processes in the pipeline actually killed? If not, why are files re-downloaded (I assume this is the reason for you to remove lines from your text file)? And, if yes, how can suspending the pipeline help? — fra-san
– fra-san, Commented May 11, 2021 at 15:50
There is a 60 minute CPU process time limit enforced by the OS, according to the documentation of this supercomputer put out by the IT department. I was under the impression that curl still attempts to download the file with the --continute-at, but perhaps I am wrong. I was also unaware that curl would parallellize anything. I don't see anything in the man page about curl being parallel — K. Shores
– K. Shores, Commented May 11, 2021 at 16:31

Eduardo Trápani · Accepted Answer · 2021-05-11 22:10:47Z

2

You could let curl do the parallel downloads with the -Z option. You need at least version 7.66.0 for that, but note that they have added more parallel related flags in the versions after 7.66.

The bare bones command would be:

curl --config myconfig.txt -Z ...

where myconfig.txt has the list of urls in this format (you can add other flags, for example to rename output, resume downloads, ...):

url = "http://example.com/a"
url = "http://example.com/j"

You can find more information on the config file at their site.

answered May 11, 2021 at 22:10

Eduardo Trápani

14.2k1 gold badge21 silver badges38 bronze badges

Add a comment |

Stack Exchange Network

Pausing and then resuming a piped command

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Pausing and then resuming a piped command

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions