0

The input to the script is a text file with multiple urls from web pages. The intended steps in the script are as follows:

  • read a url from the text file
  • strip the url to use it as a name for the output file (fname)
  • use the regex ‘clean_me’ to clean up the content of url/web page.
  • write the contents to the file (fname)
  • repeat for each file in the input file.

This is the contents of the input file urloutshort.txt;

http://feedproxy.google.com/~r/autonews/ColumnistsAndBloggers/~3/6HV2TNAKqGk/diesel-with-no-nox-emissions-it-may-be-possible

http://feedproxy.google.com/~r/entire-site-rss/~3/3j3Hyq2TJt0/kyocera-corp-opens-its-largest-floating-solar-power-plant-in-japan.html

http://feedproxy.google.com/~r/entire-site-rss/~3/KRhGaT-UH_Y/crews-replace-rhode-island-pole-held-together-with-duct-tape.html

This is the script :

import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import html5lib
import re

def clean_me(htmldoc):
    soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
    for s in soup(['script', 'style']):
    s.decompose()       
    return ' '.join(soup.stripped_strings)
with open('urloutshort.txt', 'r') as filein:
    for url in filein:
        page = requests.get(url.strip())
        fname=(url.replace('http://',' '))
        fname = fname.replace ('/',' ')
        print (fname)
        cln = clean_me(page)
        with open (fname +'.txt', 'w') as outfile:              
        outfile.write(cln +"\n")

This is the error message;

python : Traceback (most recent call last):
At line:1 char:1
+ python webpage_A.py
+ ~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

  File "webpage_A.py", line 43, in <module>
    with open (fname +'.txt', 'w') as outfile:                              
OSError: [Errno 22] Invalid argument: ' feedproxy.google.com ~r autonews ColumnistsAndBloggers ~3 6HV2TNAKqGk 
diesel-with-no-nox-emissions-it-may-be-possible\n.txt'

The problem appears related to reading the url(s) from the text file because if I bypass the script for reading the input file and just hard code one of the urls then the script will process the web page and save the results to a txt file with a name extracted from the url. I have searched the topic on SO, but have not found a solution.

Help with this issue will be greatly appreciated.

0

1 Answer 1

2

The issue is with the following piece of code :

    with open (fname +'.txt', 'w') as outfile:              
    outfile.write(cln +"\n")

fname contains "\n" which cannot be a valid file name to open. All you need to do is just change it to this

    with open (fname.rstrip() +'.txt', 'w') as outfile:              
    outfile.write(cln +"\n")

Full Code fix included:

import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import re
import html5lib

def clean_me(htmldoc):
    soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
    for s in soup(['script', 'style']):
        s.decompose()
        return ' '.join(soup.stripped_strings)


with open('urloutshort.txt', 'r') as filein:
    for url in filein:
        if "http" in url:
            page = requests.get(url.strip())
            fname = (url.replace('http://', ''))
            fname = fname.replace('/', ' ')
            print(fname)
            cln = clean_me(page)
            with open(fname.rstrip() + '.txt', 'w') as outfile:
                outfile.write(cln + "\n")

Hope this helps

Sign up to request clarification or add additional context in comments.

10 Comments

Made the change to the script as suggested. The script processes the first url as intended but not the subsequent urls in urloutshort.txt. I changed the order of the urls in the file but that did not change the outcome; the first url gets processed but not the subsequent ones.
python : Traceback (most recent call last): At line:1 char:1 + python webpage.py + ~~~~~~~~~~~~~~~~~ + CategoryInfo : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException + FullyQualifiedErrorId : NativeCommandError
File "webpage.py", line 33, in <module> page = requests.get(url.strip()) File "C:\Users\rschafish\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\api.py", line 72, in get return request('get', url, params=params, **kwargs)
File "C:\Users\rschafish\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\api.py", line 58, in request return session.request(method=method, url=url, **kwargs) File "C:\Users\rschafish\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\sessions.py", line 494, in request prep = self.prepare_request(req)
File "C:\Users\rschafish\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\sessions.py", line 437, in prepare_request hooks=merge_hooks(request.hooks, self.hooks), File "C:\Users\rschafish\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\models.py", line 305, in prepare self.prepare_url(url, params)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.