I'm downloading a lot of data for my research. The data is being downloaded on one of my campus's supercomputers, but data downloads are interrupted every hour. When the OS pauses the pipeline, I have to delete all of the lines of the text file that represent the files that have already been downloaded. Not hard, but annoying and I would prefer not to do that. Here is how I am downloading everything
cat subset.txt | tr -d '\r' | xargs -P 4 -n 1 curl -LJO -s -n --globoff -c ~/.urs_cookies -b ~/.urs_cookies
Each url is passed to curl and xargs gives me 4 parallel downloads. Is there a way to pause the entire pipeline and continue the pipeline later on?
catprocess, if this is what you mean.xargsinstead since all of the URLs have already been processed by the first 2 steps of the pipecurlstill attempts to download the file with the--continute-at, but perhaps I am wrong. I was also unaware that curl would parallellize anything. I don't see anything in the man page about curl being parallel