Reading multiple urls from text file and processing web page

Question

The input to the script is a text file with multiple urls from web pages. The intended steps in the script are as follows:

read a url from the text file
strip the url to use it as a name for the output file (fname)
use the regex ‘clean_me’ to clean up the content of url/web page.
write the contents to the file (fname)
repeat for each file in the input file.

This is the contents of the input file urloutshort.txt;

http://feedproxy.google.com/~r/autonews/ColumnistsAndBloggers/~3/6HV2TNAKqGk/diesel-with-no-nox-emissions-it-may-be-possible

http://feedproxy.google.com/~r/entire-site-rss/~3/3j3Hyq2TJt0/kyocera-corp-opens-its-largest-floating-solar-power-plant-in-japan.html

http://feedproxy.google.com/~r/entire-site-rss/~3/KRhGaT-UH_Y/crews-replace-rhode-island-pole-held-together-with-duct-tape.html

This is the script :

import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import html5lib
import re

def clean_me(htmldoc):
    soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
    for s in soup(['script', 'style']):
    s.decompose()       
    return ' '.join(soup.stripped_strings)
with open('urloutshort.txt', 'r') as filein:
    for url in filein:
        page = requests.get(url.strip())
        fname=(url.replace('http://',' '))
        fname = fname.replace ('/',' ')
        print (fname)
        cln = clean_me(page)
        with open (fname +'.txt', 'w') as outfile:              
        outfile.write(cln +"\n")

This is the error message;

python : Traceback (most recent call last):
At line:1 char:1
+ python webpage_A.py
+ ~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

  File "webpage_A.py", line 43, in <module>
    with open (fname +'.txt', 'w') as outfile:                              
OSError: [Errno 22] Invalid argument: ' feedproxy.google.com ~r autonews ColumnistsAndBloggers ~3 6HV2TNAKqGk 
diesel-with-no-nox-emissions-it-may-be-possible\n.txt'

The problem appears related to reading the url(s) from the text file because if I bypass the script for reading the input file and just hard code one of the urls then the script will process the web page and save the results to a txt file with a name extracted from the url. I have searched the topic on SO, but have not found a solution.

Help with this issue will be greatly appreciated.

toheedNiaz · Accepted Answer · 2018-04-10 21:16:43Z

2

The issue is with the following piece of code :

    with open (fname +'.txt', 'w') as outfile:              
    outfile.write(cln +"\n")

fname contains "\n" which cannot be a valid file name to open. All you need to do is just change it to this

    with open (fname.rstrip() +'.txt', 'w') as outfile:              
    outfile.write(cln +"\n")

Full Code fix included:

import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import re
import html5lib

def clean_me(htmldoc):
    soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
    for s in soup(['script', 'style']):
        s.decompose()
        return ' '.join(soup.stripped_strings)


with open('urloutshort.txt', 'r') as filein:
    for url in filein:
        if "http" in url:
            page = requests.get(url.strip())
            fname = (url.replace('http://', ''))
            fname = fname.replace('/', ' ')
            print(fname)
            cln = clean_me(page)
            with open(fname.rstrip() + '.txt', 'w') as outfile:
                outfile.write(cln + "\n")

Hope this helps

edited Apr 10, 2018 at 21:16

answered Apr 9, 2018 at 20:52

toheedNiaz

1,4551 gold badge10 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

RJS Over a year ago

Made the change to the script as suggested. The script processes the first url as intended but not the subsequent urls in urloutshort.txt. I changed the order of the urls in the file but that did not change the outcome; the first url gets processed but not the subsequent ones.

RJS Over a year ago

python : Traceback (most recent call last): At line:1 char:1 + python webpage.py + ~~~~~~~~~~~~~~~~~ + CategoryInfo : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException + FullyQualifiedErrorId : NativeCommandError

RJS Over a year ago

File "webpage.py", line 33, in <module> page = requests.get(url.strip()) File "C:\Users\rschafish\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\api.py", line 72, in get return request('get', url, params=params, **kwargs)

RJS Over a year ago

File "C:\Users\rschafish\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\api.py", line 58, in request return session.request(method=method, url=url, **kwargs) File "C:\Users\rschafish\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\sessions.py", line 494, in request prep = self.prepare_request(req)

RJS Over a year ago

File "C:\Users\rschafish\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\sessions.py", line 437, in prepare_request hooks=merge_hooks(request.hooks, self.hooks), File "C:\Users\rschafish\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\models.py", line 305, in prepare self.prepare_url(url, params)

|

Collectives™ on Stack Overflow

Reading multiple urls from text file and processing web page

1 Answer 1

10 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Related