0

I'm new to software development, and I'm not sure how to go about this. I want to visit every page of a website and grab a specific bit of data from each one. My problem is, I don't know how to iterate through all of the existing pages without knowing the individual urls ahead of time. For example, I want to visit every page whose url starts with

"http://stackoverflow.com/questions/"

Is there a way to compile a list and then iterate through that, or is it possible to do this without creating a giant list of urls?

3 Answers 3

5

Try Scrapy.

It handles all of the crawling for you and lets you focus on processing the data, not extracting it. Instead of copy-pasting the code already in the tutorial, I'll leave it to you to read it.

Sign up to request clarification or add additional context in comments.

2 Comments

+1 for Scrapy. Has a bit of a learning curve, but easy to use once you get a hang of it.
Thanks, I think I'll try that. My problem isn't really processing the data, but the search for it. I suppose if I had know the technical terms, I could have looked this up myself. Thanks for the help!
0

To grab a specific bit of data from a web site you could use some web scraping tool e.g., scrapy.

If required data is generated by javascript then you might need browser-like tool such as Selenium WebDriver and implement crawling of the links by hand.

Comments

-2

For example, you can make a simple for loop, like this:

def webIterate():
    base_link = "http://stackoverflow.com/questions/"
    for i in xrange(24):
        print "http://stackoverflow.com/questions/%d" % (i)

The output will be:

http://stackoverflow.com/questions/0
http://stackoverflow.com/questions/2
http://stackoverflow.com/questions/3
...
http://stackoverflow.com/questions/23

It's just an example. You can pass numbers of questions and make with them whatever you want

4 Comments

I think StackOverflow was just an example. Other websites don't have such a well-defined URL scheme and need to be parsed via crawling.
Maibe. But it would be really much easier to help author if he will tell us the real example of the site, needed to be aggregated:)
I see how that would work, but stackoverflow was just an example, the site I'm trying to search doesn't use numerical values to number pages.
Give us an example and we'll try to find out the solution:)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.