1

I have a python script for scraping some URLs. The URLs are in a list in a txt file.

The python script (only relevant parts) are as follows:

import urllib2
from bs4 import BeautifulSoup
quote_page = 'https://www.example.com/post/1245'

# rest of the code is here

print quote_page
print url
print title
print description
print actors
print director

I would like to run this script for multiple URLs in a txt file and output to a single txt file.

Any ideas how I can run this for my URLs in txt file?

1 Answer 1

1

You will likely want to use the Python with statement (introduced in PEP 343) and the built-in open() function:

# Python 2
import urllib2
import BeautifulSoup

# Python 3
# import urllib3
# from bs4 import BeautifulSoup

# Python 2.6+ and Python 3
with open('urls.txt','r') as url_file, open('output.txt', 'w') as output_file:

    url_list = url_file.readlines()

    for url_item in url_list:

        # quote_page = 'https://www.example.com/post/1245'
        quote_page = url_item

        # rest of the code is here

        # Python 2 and 3
        output_file.write(quote_page)
        output_file.write(url)
        output_file.write(title)
        output_file.write(description)
        output_file.write(actors)
        output_file.write(director)
        output_file.write('\n')

In this instance, we:

  1. open() file handles (url_file,output_file) to our input and output text files ('urls.txt','output.txt') at the same time (using 'r' for reading and 'w' for writing, respectively).

  2. Use the with statement to close these files automatically after we are done fully processing our URLs. Normally, we would need to issue separate e.g. url_file.close() and output_file.close() commands (ex. at Step 5).

  3. Put our URLs into a list (url_list = url_file.readlines()).

  4. Loop through our URL list and write() the data we want to our output_file.

  5. close() both of our files automatically (see Step 2).

Note that to simply add data to an existing output_file, you will probably wish to use 'a' (append mode) rather than 'w' (write mode). So e.g. open('output.txt', 'w') as output_file would become open('output.txt', 'a') as output_file. This is important because 'w' (write mode) will truncate the file if the file already exists (i.e. you will lose your original data).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.