1

I created a web-parser using Python 3.7 and Beautifulsoup. Then, I used "find_all" to find all tags with a certain class. Something important is that the website that I am scraping has some Chinese characters. Here's my code:

import requests
from bs4 import BeautifulSoup

response = requests.get('URL_GOES_HERE')
soup = BeautifulSoup(response.content, 'html.parser')
posts = soup.find_all(class_='CLASS_GOES_HERE')

print(posts)

saveFile = open('index.html','w+')
saveFile.write(str(posts))
saveFile.close()

I tried outputting the data in two different ways: by printing the data onto the console, and by writing it to an HTML document. I did each separately, by "commenting out" the print function when writing to an HTML, and vice versa.

When I run the print function only, it outputs the data onto the console just fine, without any errors. However, when I run the function to write to an HTML, I get the following encoding error:

Traceback (most recent call last):
  File "postthis.py", line 11, in <module>
    saveFile.write(str(posts))
  File "C:\Users\atit1\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 4308-4324: character maps to <undefined>"

I've spent the past 2 days trying to fix this using guidance from many similar questions on Stackoverflow. Many answers suggested to add ".encode("utf-8")", so I tried that. For example, when I try to .write(str(soup)), I get the encoding error. But when I write this, it works perfectly:

saveFile.write(str(soup.encode("utf-8")))

However, the problem is that this will print out the website's whole HTML document onto my HTML document, while I only want it to write some classes. Logically (er, maybe not?), I then tried to add .encode to my posts variable like this:

saveFile.write(str(posts.encode("utf-8")))

But I keep running into this error, and I can't figure out why:

Traceback (most recent call last):
  File "webscraper.py", line 21, in <module>
    saveFile.write(str(posts.encode("utf-8")))
  File "C:\Users\atit1\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\element.py", line 1620, in __getattr__
    "ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'encode'. You're probably treating a list of items like a single item. Did you
call find_all() when you meant to call find()?

Does anyone have some suggestions on how to fix this error? By the way, all I need is the English text from the website, so if your solution will omit/damage the special Chinese characters, that's okay.

EDIT 1 Here is a part of the HTML source code that I am trying to parse. There are about 50 of these lists, and a few include foreign names, so I am getting that encoding error when I try to parse this.

<li>
    <div itemscope="SOME_WORDS" itemid="SOME_URL" itemtype="SOME_URL">\

        <meta itemprop="url" content="SOME_URL"/>
        <a class="THE_CLASS_I_WANT"  href="THE_URL_I_WANT">

            <span itemprop="SOME_WORDS">
                THE TEXT I WANT 
            </span>
        </a> 

    </div>      
</li>   
4
  • add the binary option, open('index.html','wb+') any difference? Commented Jun 27, 2019 at 2:23
  • @StanS. Thanks for the reply. I receive the same error when doing that. However, when I remove the str part and run it, I get this error: " TypeError: a bytes-like object is required, not 'ResultSet' ". Commented Jun 27, 2019 at 2:28
  • Without seeing the content you are parsing it's difficult to help you. Did you try the suggestions mentioned from the error output (eg. changing find_all to find)? Also why are you attempting to encode the output anyway? It should already be properly encoded, so that's unnecessary. Commented Jun 27, 2019 at 2:30
  • @l'L'l Hi, I edited my post to add the content I'm parsing (except I removed the URL and words because it is a part of my friend's business). Changing the find_all to find was not very helpful, and I just got more errors when doing so. Also, I am trying to encode the output because when I don't, I receive that encoding error, and some answers on Stackoverflow suggested that we add the .encode to get rid of it, but it's not working for some reason. Commented Jun 27, 2019 at 2:45

1 Answer 1

3

You are trying to call the .encode() method of posts, which doesn't have one. posts was returned by find_all() - assuming you actually want to find all of them, you would have to encode all the found elements separately.

Also, instead of writing the entire list of posts to a file as a list, you'd probably want to create a valid html document, which is another problem altogether.

To do what you're doing (even though I think it won't be what you end up wanting):

saveFile = open('index.html','w+')
saveFile.write(str([post.encode('utf-8') for post in posts]))
saveFile.close()

Or, probably a bit better, but perhaps still not quite the result you might need:

saveFile = open('index.html','wb+')
for post in posts:
    saveFile.write(post.encode('utf-8'))
saveFile.close()

Note the important differences: instead of just writing a string conversion of the entire list, each element is encoded separately and the resulting bytes are written to a file that has been opened in binary (not text) mode with wb.

Sign up to request clarification or add additional context in comments.

2 Comments

Hey thanks so much friend, the first line did not work, but the second one did! I appreciate it very much, I spent way to much time on this!! :)
Glad it works for you. You'll probably want to do a bit more work to make index.html an actual valid html document, but it's quite possible you don't need it to be more than just a combination of the selected elements.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.