0

So I have some code that I use to scrape through my mailbox looking for certain URL's. Once this is completed it creates a file called links.txt

I want to run a script against that file to get an output of all the current URL's that are live in that list. The script I have only allows for me to check on URL at a time

import urllib2

for url in ["www.google.com"]:

    try:
        connection = urllib2.urlopen(url)
        print connection.getcode()
        connection.close()
    except urllib2.HTTPError, e:
        print e.getcode()
1
  • So, you just need to know how to read lines from a text file? Commented Aug 13, 2012 at 21:29

2 Answers 2

4

Use requests:

import requests

with open(filename) as f:
    good_links = []
    for link in file:
        try:
            r = requests.get(link.strip())
        except Exception:
            continue
        good_links.append(r.url) #resolves redirects

You can also consider extracting the call to requests.get into a helper function:

def make_request(method, url, **kwargs):
    for i in range(10):
        try:
            r = requests.request(method, url, **kwargs)
            return r
        except requests.ConnectionError as e:
            print e.message
        except requests.HTTPError as e:
            print e.message
        except requests.RequestException as e:
            print e.message
    raise Exception("requests did not succeed")
Sign up to request clarification or add additional context in comments.

Comments

1

It is trivial to make this change, given that you're already iterating over a list of URLs:

import urllib2

for url in open("urllist.txt"):   # change 1

    try:
        connection = urllib2.urlopen(url.rstrip())   # change 2
        print connection.getcode()
        connection.close()
    except urllib2.HTTPError, e:
        print e.getcode()

Iterating over a file returns the lines of the file (complete with line endings). We use rstrip() on the URL to strip off the line endings.

There are other improvements you can make. For example, some will suggest you use with to make sure your file is closed. This is good practice but probably not necessary in this script.

4 Comments

#!/usr/bin/python import urllib2 for url in open("ZeuS_links"): # change 1 try: connection = urllib2.urlopen(url.rstrip()) # change 2 print connection.getcode() connection.close() except urllib2.HTTPError, e: print e.getcode()
Kindall thanks for the assistance the script is now working only thing I would like to find out now is instead of 200 or 401 just push out the actual urls I will continue to work with it and update as I go thanks again for your help
You already have url as the URL you're trying to retrieve; just print that.
that did it sometimes the small things are right in front of you. Thanks alot kindall. Awesome Awesome Awesome

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.