1

I'm downloading a lot of data for my research. The data is being downloaded on one of my campus's supercomputers, but data downloads are interrupted every hour. When the OS pauses the pipeline, I have to delete all of the lines of the text file that represent the files that have already been downloaded. Not hard, but annoying and I would prefer not to do that. Here is how I am downloading everything

cat subset.txt | tr -d '\r' | xargs -P 4 -n 1 curl -LJO -s -n --globoff -c ~/.urs_cookies -b ~/.urs_cookies

Each url is passed to curl and xargs gives me 4 parallel downloads. Is there a way to pause the entire pipeline and continue the pipeline later on?

13
  • You can send SIGSTOP and SIGCONT to the cat process, if this is what you mean. Commented May 11, 2021 at 14:08
  • Hm. I shall try it out. Thanks Commented May 11, 2021 at 14:12
  • That does not seem to work. I think I need to pause xargs instead since all of the URLs have already been processed by the first 2 steps of the pipe Commented May 11, 2021 at 14:17
  • 1
    What do you mean by "the OS pauses the pipeline"? Are the processes in the pipeline actually killed? If not, why are files re-downloaded (I assume this is the reason for you to remove lines from your text file)? And, if yes, how can suspending the pipeline help? Commented May 11, 2021 at 15:50
  • 1
    There is a 60 minute CPU process time limit enforced by the OS, according to the documentation of this supercomputer put out by the IT department. I was under the impression that curl still attempts to download the file with the --continute-at, but perhaps I am wrong. I was also unaware that curl would parallellize anything. I don't see anything in the man page about curl being parallel Commented May 11, 2021 at 16:31

1 Answer 1

2

You could let curl do the parallel downloads with the -Z option. You need at least version 7.66.0 for that, but note that they have added more parallel related flags in the versions after 7.66.

The bare bones command would be:

curl --config myconfig.txt -Z ...

where myconfig.txt has the list of urls in this format (you can add other flags, for example to rename output, resume downloads, ...):

url = "http://example.com/a"
url = "http://example.com/j"

You can find more information on the config file at their site.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.