The input to the script is a text file with multiple urls from web pages. The intended steps in the script are as follows:
- read a url from the text file
- strip the url to use it as a name for the output file (fname)
- use the regex ‘clean_me’ to clean up the content of url/web page.
- write the contents to the file (fname)
- repeat for each file in the input file.
This is the contents of the input file urloutshort.txt;
This is the script :
import os
import sys
import requests
import bs4
from bs4 import BeautifulSoup
import html5lib
import re
def clean_me(htmldoc):
soup = BeautifulSoup(htmldoc.text.encode('UTF-8'), 'html5lib')
for s in soup(['script', 'style']):
s.decompose()
return ' '.join(soup.stripped_strings)
with open('urloutshort.txt', 'r') as filein:
for url in filein:
page = requests.get(url.strip())
fname=(url.replace('http://',' '))
fname = fname.replace ('/',' ')
print (fname)
cln = clean_me(page)
with open (fname +'.txt', 'w') as outfile:
outfile.write(cln +"\n")
This is the error message;
python : Traceback (most recent call last):
At line:1 char:1
+ python webpage_A.py
+ ~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError
File "webpage_A.py", line 43, in <module>
with open (fname +'.txt', 'w') as outfile:
OSError: [Errno 22] Invalid argument: ' feedproxy.google.com ~r autonews ColumnistsAndBloggers ~3 6HV2TNAKqGk
diesel-with-no-nox-emissions-it-may-be-possible\n.txt'
The problem appears related to reading the url(s) from the text file because if I bypass the script for reading the input file and just hard code one of the urls then the script will process the web page and save the results to a txt file with a name extracted from the url. I have searched the topic on SO, but have not found a solution.
Help with this issue will be greatly appreciated.