Save multiple URL targets to text files

Question

I have a list of URLs and I want to save each of their targets in a separate text file.

Here's an example of the input file containing the URLs:

~$: head -3 url.txt 
http://www.uniprot.org/uniprot/P32234.txt
http://www.uniprot.org/uniprot/P05552.txt 
http://www.uniprot.org/uniprot/P07701.txt

I'm currently using a Python custom function to accomplish this task. It works, but the main inconvenient are: user has to copy-paste URLs manually (there's no direct file input) and the output contains some 'b' characters at the beginning of each line (?binary).

~$: head -3 P32234.txt
b' ID   128UP_DROME             Reviewed;         368 AA.
'b' AC   P32234; Q9V648;
'b' DT   01-OCT-1993, integrated into UniProtKB/Swiss-Prot.

Here's the Python code:

def html_to_txt(): 
    import urllib.request 
    url = str(input('Enter URL: ')) 
    page = urllib.request.urlopen(url) 
    with open(str(input('Enter filename: ')), "w") as f: 
        for x in page: 
            f.write(str(x).replace('\\n','\n')) 
    s= 'Done' 
    return s

Is there a cleaner way of doing this using some Unix utilities?

FYI, those b'strings with a b prefix' are Python 3 bytes objects. Such data should be written to a file opened in binary mode, and not converted to str. — Tom Zych
– Tom Zych, Commented Aug 6, 2014 at 9:59

Community · Accepted Answer · 2020-06-11 12:04:56Z

Use -i option:

wget -i ./url.txt

From man wget:

-i file

--input-file=file

Read URLs from a local or external file. If - is specified as file, URLs are read from the standard input. (Use ./- to read from a file literally named -.) If this function is used, no URLs need be present on the command line. If there are URLs both on the command line and in an input file, those on the command lines will be the first ones to be retrieved. If --force-html is not specified, then file should consist of a series of URLs, one per line.

However, if you specify --force-html, the document will be regarded as html. In that case you may have problems with relative links, which you can solve either by adding "" to the documents or by specifying --base=url on the command line.

If the file is an external one, the document will be automatically treated as html if the Content-Type matches text/html. Furthermore, the file's location will be implicitly used as base href if none was specified.

Michael Homer · Accepted Answer · 2014-08-06 08:47:34Z

5

wget has an option for doing exactly this:

wget --input-file url.txt

will read one URL per line out of url.txt and download them into the current directory sequentially.

More generally, you can use xargs for this sort of thing, combined with wget or curl:

xargs wget < url.txt
xargs curl -O < url.txt

xargs reads each line of its input and provides it as an argument to a command you give it. Here that command is wget or curl -O, both of which download a URL and save it into the current directory. < url.txt provides the contents of url.txt as the input to the xargs command.

The problem with your Python code is that what you get out of urllib is byte data that you're then printing directly to a file, which stringifies the bytes to b'abc\00\0a...' (which is how you write byte literals).

edited Aug 6, 2014 at 8:47

answered Aug 6, 2014 at 7:51

Michael Homer

78.9k17 gold badges221 silver badges239 bronze badges

wget also has the -i (--input-file) switch for this.

Ulrich Schwarz
– Ulrich Schwarz

2014-08-06 08:06:41 +00:00
Commented Aug 6, 2014 at 8:06
Good point! Edited.

Michael Homer
– Michael Homer

2014-08-06 08:10:42 +00:00
Commented Aug 6, 2014 at 8:10
2

@MichaelHomer: xargs solution is bad one since when it will call some wget process for many urls. Use -i option calls one wget process instead.

cuonglm
– cuonglm

2014-08-06 08:22:12 +00:00
Commented Aug 6, 2014 at 8:22

Add a comment |

mikeserv · Accepted Answer · 2014-08-06 09:14:55Z

with w3m:

echo 'http://unix.stackexchange.com/questions/148670/save-html-to-text-file' |
tee - - - | 
xargs -n1 w3m -dump | 
sed '/Save html/!d;N;N;N;N;N;N;N'

It seems to me that xargs shouldn't even be necessary - surely there's a setting for multiple urls at once, but I can't grok it at the moment. In any case, xargs works:

Save html to text file

            I'd like to save some (plain HTML) web pages to text file, from URL
            stored in text files as well.

            Here's an exemple of the input file containing the URLs:

            ~$: head -3 url.txt
Save html to text file

            I'd like to save some (plain HTML) web pages to text file, from URL
            stored in text files as well.

            Here's an exemple of the input file containing the URLs:

            ~$: head -3 url.txt
Save html to text file

            I'd like to save some (plain HTML) web pages to text file, from URL
            stored in text files as well.

            Here's an exemple of the input file containing the URLs:

            ~$: head -3 url.txt
Save html to text file

            I'd like to save some (plain HTML) web pages to text file, from URL
            stored in text files as well.

            Here's an exemple of the input file containing the URLs:

            ~$: head -3 url.txt

terdon · Accepted Answer · 2014-08-06 14:32:14Z

2

I would do this in the shell with wget.

while read y; do
     wget "$y"
done < url.txt

edited Aug 6, 2014 at 14:32

terdon♦

253k69 gold badges481 silver badges719 bronze badges

answered Aug 6, 2014 at 7:51

chicks

1,1281 gold badge9 silver badges27 bronze badges

1

The cat file in a for loop is discouraged. See read builtin. You'd want to use the file as input and iterate over the lines cointained within the file.

Valentin Bajrami
– Valentin Bajrami

2014-08-06 08:28:02 +00:00
Commented Aug 6, 2014 at 8:28

Add a comment |

Valentin Bajrami · Accepted Answer · 2014-08-06 08:18:02Z

1

There are two other methods:

wget $(<file)

and

while read -r link; do wget "$link"; done < file

answered Aug 6, 2014 at 8:18

Valentin Bajrami

9,5773 gold badges28 silver badges39 bronze badges

Add a comment |

terdon · Accepted Answer · 2014-08-06 14:28:49Z

1

Personally, I would just keep the UniProt ACs in the file:

$ cat names
P32234
P05552
P07701

You can then use the same file for various operations. For example, to download the corresponding flat file from UniProt, feed it into a loop:

while read prot; do 
    wget http://www.uniprot.org/uniprot/"$prot".txt -O "$prot".flat
done < names

Since your file now just has the accessions, you can re-use it to get, for example, the corresponding IDs:

$ while read prot; do  
    printf "%s\t" "$prot"
    wget http://www.uniprot.org/uniprot/"$prot".txt -O - | 
        awk '$1=="ID"{print $2}'; 
 done 2>/dev/null < names 
P32234  128UP_DROME
P05552  ADF1_DROME
P07701  SGS5_DROME

answered Aug 6, 2014 at 14:28

terdon♦

253k69 gold badges481 silver badges719 bronze badges

That 's smart !

dovah
– dovah

2014-08-06 15:32:16 +00:00
Commented Aug 6, 2014 at 15:32

Add a comment |

Stack Exchange Network

Save multiple URL targets to text files

6 Answers 6

You must log in to answer this question.

Hot Network Questions

Save multiple URL targets to text files

6 Answers 6

You must log in to answer this question.

Related

Hot Network Questions