How to read a url directory in Python

Question

I can use the urllib module to open a url file and read its contents.

>>> urllib.urlopen('file://localhost/tmp/foobar').read()

The above does not work with a directory - I want to read (list contents) a url directory. How to do it?

Added: for some reason I am failing to explain this so you understand. I have a url (could be anything, local directory, remote directory, ftp:// http:// or anything:// any url of a directory that I have permissions to access, the evidence being, I can execute urllib.urlopen(url of a file in that directory).read() and it works. Then my question is, how to do the same for the directory itself, listing its contents.

If I guess correctly the name of a file in the url directory, then I can get to that file, as above. Then it seems to me, there should be a way to do it without guessing, that is, get the list of files first.

I could do a (very long) search for all names, starting from 1 character names, ask for all these, then all 2 character combinations, and so forth. Although this is impractical, it shows in principle that I can get at all the names of the files, eventually. Then there should be a way to do this quickly.

What do you mean by opening a file that is a directory? Do you mean when you visit a website that serves up something that looks like a directory listing? — David Sanders
– David Sanders, Commented Apr 22, 2014 at 20:16
Why are you using urllib to read local files instead of using the ordinary functions to access files (File objects)? — Martin
– Martin, Commented Apr 22, 2014 at 20:41
URLs are a way of providing the location of a resource, but does say anything about which capabilities the resource provides. The file scheme doesn't provide a method to list the content of a directory and the for the http scheme it depends on the configuration of the webserver. There are multiple other schemes and some have no idea about directories (see en.wikipedia.org/wiki/…). You simply can't do what you think you "should" be able to without attempting to guess the resource names. — Martin
– Martin, Commented Apr 22, 2014 at 20:59

Luigi · Accepted Answer · 2014-04-27 05:54:33Z

In short, yes, but use requests.

I'm going to give an example using the requests module as it is much preferred to directly using urllib (and literally three lines of code).

I'll be using this as an example, which I think is what you mean by 'file directory'

>>> import requests
>>> r = requests.get('http://www.tulane.edu/~howard/SPAN-NLP/mp3/')   
>>> print r.text

This directory contains a list of podcasts. Here is the result of r.text:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>Index of /~howard/SPAN-NLP/mp3</title>
 </head>
 <body>
<h1>Index of /~howard/SPAN-NLP/mp3</h1>
<ul><li><a href="/~howard/SPAN-NLP/"> Parent Directory</a></li>
<li><a href="SPAN4350-01-Intro.MP3"> SPAN4350-01-Intro.MP3</a></li>
<li><a href="SPAN4350-02-CompLeng1.MP3"> SPAN4350-02-CompLeng1.MP3</a></li>
<li><a href="SPAN4350-03-ListasCadenas.MP3"> SPAN4350-03-ListasCadenas.MP3</a></li>
<li><a href="SPAN4350-04-Cadenas2.MP3"> SPAN4350-04-Cadenas2.MP3</a></li>
<li><a href="SPAN4350-05-Cadenas3.MP3"> SPAN4350-05-Cadenas3.MP3</a></li>
<li><a href="SPAN4350-06-Cadenas4.MP3"> SPAN4350-06-Cadenas4.MP3</a></li>
<li><a href="SPAN4350-09-UnicodeRegex.MP3"> SPAN4350-09-UnicodeRegex.MP3</a></li>
<li><a href="SPAN4350-10-Regex.MP3"> SPAN4350-10-Regex.MP3</a></li>
<li><a href="SPAN4350-11-Regextoken.MP3"> SPAN4350-11-Regextoken.MP3</a></li>
<li><a href="SPAN4350-12-NLTK.MP3"> SPAN4350-12-NLTK.MP3</a></li>
<li><a href="SPAN4350-13-NLTK_Control.MP3"> SPAN4350-13-NLTK_Control.MP3</a></li>
<li><a href="SPAN4350-14-Control2.MP3"> SPAN4350-14-Control2.MP3</a></li>
<li><a href="SPAN4350-15-Control3.MP3"> SPAN4350-15-Control3.MP3</a></li>
<li><a href="SPAN4350-16-Control4.MP3"> SPAN4350-16-Control4.MP3</a></li>
<li><a href="SPAN4350-17-Control5.MP3"> SPAN4350-17-Control5.MP3</a></li>
<li><a href="SPAN4350-18-ReciclarCodigo.MP3"> SPAN4350-18-ReciclarCodigo.MP3</a></li>
<li><a href="SPAN4350-19-Funciones.MP3"> SPAN4350-19-Funciones.MP3</a></li>
<li><a href="SPAN4350-21-Funciones2.MP3"> SPAN4350-21-Funciones2.MP3</a></li>
<li><a href="SPAN4350-22-ComputacionLeng.MP3"> SPAN4350-22-ComputacionLeng.MP3</a></li>
<li><a href="SPAN4350-23-ComputacionLeng2.MP3"> SPAN4350-23-ComputacionLeng2.MP3</a></li>
<li><a href="SPAN4350-24-ComputacionLeng3.mp3"> SPAN4350-24-ComputacionLeng3.mp3</a></li>
<li><a href="SPAN4350-25-ComputacionLeng4.MP3"> SPAN4350-25-ComputacionLeng4.MP3</a></li>
<li><a href="SPAN4350-26-ComputacionLeng5.MP3"> SPAN4350-26-ComputacionLeng5.MP3</a></li>
<li><a href="SPAN4350-27-Tuiter.MP3"> SPAN4350-27-Tuiter.MP3</a></li>
<li><a href="SPAN4350-30-Tuiter3.MP3"> SPAN4350-30-Tuiter3.MP3</a></li>
<li><a href="SPAN4350-31-Tuiter4.MP3"> SPAN4350-31-Tuiter4.MP3</a></li>
<li><a href="SPAN4350-32-Web.MP3"> SPAN4350-32-Web.MP3</a></li>
<li><a href="SPAN4350-33-Web2.MP3"> SPAN4350-33-Web2.MP3</a></li>
<li><a href="SPAN4352-34-Youtube.MP3"> SPAN4352-34-Youtube.MP3</a></li>
<li><a href="SPAN4352-35-Youtube2.MP3"> SPAN4352-35-Youtube2.MP3</a></li>
</ul>
</body></html>

As you can see, it's basically the representation of all the files in the directory as a html document. You could very easily extract all links using regular expressions and iterate over them to access all of the files.

This will only work if the place the files are hosted is configured to return this type of document. Most do, but if it is otherwise configured, I don't know of another way to do so programmatically.

Also, probably don't brute force using all character combinations. There are much better ways to do so (generally people use words as file names with possibly a number at the end, also the words usually relate to the contents of the file so you could use that to guess if you know what type of thing you're looking for, etc.)

Collectives™ on Stack Overflow

How to read a url directory in Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related