How to fix encoding error when Python writes to HTML

Question

I created a web-parser using Python 3.7 and Beautifulsoup. Then, I used "find_all" to find all tags with a certain class. Something important is that the website that I am scraping has some Chinese characters. Here's my code:

import requests
from bs4 import BeautifulSoup

response = requests.get('URL_GOES_HERE')
soup = BeautifulSoup(response.content, 'html.parser')
posts = soup.find_all(class_='CLASS_GOES_HERE')

print(posts)

saveFile = open('index.html','w+')
saveFile.write(str(posts))
saveFile.close()

I tried outputting the data in two different ways: by printing the data onto the console, and by writing it to an HTML document. I did each separately, by "commenting out" the print function when writing to an HTML, and vice versa.

When I run the print function only, it outputs the data onto the console just fine, without any errors. However, when I run the function to write to an HTML, I get the following encoding error:

Traceback (most recent call last):
  File "postthis.py", line 11, in <module>
    saveFile.write(str(posts))
  File "C:\Users\atit1\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 4308-4324: character maps to <undefined>"

I've spent the past 2 days trying to fix this using guidance from many similar questions on Stackoverflow. Many answers suggested to add ".encode("utf-8")", so I tried that. For example, when I try to .write(str(soup)), I get the encoding error. But when I write this, it works perfectly:

saveFile.write(str(soup.encode("utf-8")))

However, the problem is that this will print out the website's whole HTML document onto my HTML document, while I only want it to write some classes. Logically (er, maybe not?), I then tried to add .encode to my posts variable like this:

saveFile.write(str(posts.encode("utf-8")))

But I keep running into this error, and I can't figure out why:

Traceback (most recent call last):
  File "webscraper.py", line 21, in <module>
    saveFile.write(str(posts.encode("utf-8")))
  File "C:\Users\atit1\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\element.py", line 1620, in __getattr__
    "ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'encode'. You're probably treating a list of items like a single item. Did you
call find_all() when you meant to call find()?

Does anyone have some suggestions on how to fix this error? By the way, all I need is the English text from the website, so if your solution will omit/damage the special Chinese characters, that's okay.

EDIT 1 Here is a part of the HTML source code that I am trying to parse. There are about 50 of these lists, and a few include foreign names, so I am getting that encoding error when I try to parse this.

<li>
    <div itemscope="SOME_WORDS" itemid="SOME_URL" itemtype="SOME_URL">\

        <meta itemprop="url" content="SOME_URL"/>
        <a class="THE_CLASS_I_WANT"  href="THE_URL_I_WANT">

            <span itemprop="SOME_WORDS">
                THE TEXT I WANT 
            </span>
        </a> 

    </div>      
</li>

add the binary option, open('index.html','wb+') any difference? — Stan S.
– Stan S., Commented Jun 27, 2019 at 2:23
@StanS. Thanks for the reply. I receive the same error when doing that. However, when I remove the str part and run it, I get this error: " TypeError: a bytes-like object is required, not 'ResultSet' ". — George Orwell
– George Orwell, Commented Jun 27, 2019 at 2:28
Without seeing the content you are parsing it's difficult to help you. Did you try the suggestions mentioned from the error output (eg. changing find_all to find)? Also why are you attempting to encode the output anyway? It should already be properly encoded, so that's unnecessary. — l'L'l
– l'L'l, Commented Jun 27, 2019 at 2:30
@l'L'l Hi, I edited my post to add the content I'm parsing (except I removed the URL and words because it is a part of my friend's business). Changing the find_all to find was not very helpful, and I just got more errors when doing so. Also, I am trying to encode the output because when I don't, I receive that encoding error, and some answers on Stackoverflow suggested that we add the .encode to get rid of it, but it's not working for some reason. — George Orwell
– George Orwell, Commented Jun 27, 2019 at 2:45

Grismar · Accepted Answer · 2019-06-27 03:42:27Z

3

You are trying to call the .encode() method of posts, which doesn't have one. posts was returned by find_all() - assuming you actually want to find all of them, you would have to encode all the found elements separately.

Also, instead of writing the entire list of posts to a file as a list, you'd probably want to create a valid html document, which is another problem altogether.

To do what you're doing (even though I think it won't be what you end up wanting):

saveFile = open('index.html','w+')
saveFile.write(str([post.encode('utf-8') for post in posts]))
saveFile.close()

Or, probably a bit better, but perhaps still not quite the result you might need:

saveFile = open('index.html','wb+')
for post in posts:
    saveFile.write(post.encode('utf-8'))
saveFile.close()

Note the important differences: instead of just writing a string conversion of the entire list, each element is encoded separately and the resulting bytes are written to a file that has been opened in binary (not text) mode with wb.

edited Jun 27, 2019 at 3:42

answered Jun 27, 2019 at 2:52

Grismar

32.4k6 gold badges43 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

George Orwell Over a year ago

Hey thanks so much friend, the first line did not work, but the second one did! I appreciate it very much, I spent way to much time on this!! :)

Grismar Over a year ago

Glad it works for you. You'll probably want to do a bit more work to make index.html an actual valid html document, but it's quite possible you don't need it to be more than just a combination of the selected elements.

Collectives™ on Stack Overflow

How to fix encoding error when Python writes to HTML

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related