I created a web-parser using Python 3.7 and Beautifulsoup. Then, I used "find_all" to find all tags with a certain class. Something important is that the website that I am scraping has some Chinese characters. Here's my code:
import requests
from bs4 import BeautifulSoup
response = requests.get('URL_GOES_HERE')
soup = BeautifulSoup(response.content, 'html.parser')
posts = soup.find_all(class_='CLASS_GOES_HERE')
print(posts)
saveFile = open('index.html','w+')
saveFile.write(str(posts))
saveFile.close()
I tried outputting the data in two different ways: by printing the data onto the console, and by writing it to an HTML document. I did each separately, by "commenting out" the print function when writing to an HTML, and vice versa.
When I run the print function only, it outputs the data onto the console just fine, without any errors. However, when I run the function to write to an HTML, I get the following encoding error:
Traceback (most recent call last):
File "postthis.py", line 11, in <module>
saveFile.write(str(posts))
File "C:\Users\atit1\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 4308-4324: character maps to <undefined>"
I've spent the past 2 days trying to fix this using guidance from many similar questions on Stackoverflow. Many answers suggested to add ".encode("utf-8")", so I tried that. For example, when I try to .write(str(soup)), I get the encoding error. But when I write this, it works perfectly:
saveFile.write(str(soup.encode("utf-8")))
However, the problem is that this will print out the website's whole HTML document onto my HTML document, while I only want it to write some classes. Logically (er, maybe not?), I then tried to add .encode to my posts variable like this:
saveFile.write(str(posts.encode("utf-8")))
But I keep running into this error, and I can't figure out why:
Traceback (most recent call last):
File "webscraper.py", line 21, in <module>
saveFile.write(str(posts.encode("utf-8")))
File "C:\Users\atit1\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\element.py", line 1620, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'encode'. You're probably treating a list of items like a single item. Did you
call find_all() when you meant to call find()?
Does anyone have some suggestions on how to fix this error? By the way, all I need is the English text from the website, so if your solution will omit/damage the special Chinese characters, that's okay.
EDIT 1 Here is a part of the HTML source code that I am trying to parse. There are about 50 of these lists, and a few include foreign names, so I am getting that encoding error when I try to parse this.
<li>
<div itemscope="SOME_WORDS" itemid="SOME_URL" itemtype="SOME_URL">\
<meta itemprop="url" content="SOME_URL"/>
<a class="THE_CLASS_I_WANT" href="THE_URL_I_WANT">
<span itemprop="SOME_WORDS">
THE TEXT I WANT
</span>
</a>
</div>
</li>
find_alltofind)? Also why are you attempting toencodethe output anyway? It should already be properly encoded, so that's unnecessary.