Python 3 UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d

Question

I want to make search engine and I follow tutorial in some web. I want to test parse html

from bs4 import BeautifulSoup

def parse_html(filename):
    """Extract the Author, Title and Text from a HTML file
    which was produced by pdftotext with the option -htmlmeta."""
    with open(filename) as infile:
        html = BeautifulSoup(infile, "html.parser", from_encoding='utf-8')
        d = {'text': html.pre.text}
        if html.title is not None:
            d['title'] = html.title.text
        for meta in html.findAll('meta'):
            try:
                if meta['name'] in ('Author', 'Title'):
                    d[meta['name'].lower()] = meta['content']
            except KeyError:
                continue
        return d

parse_html("C:\\pdf\\pydf\\data\\muellner2011.html")

and it getting error

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 867: character maps to <undefined>enter code here

I saw some solutions on the Web using the encode(). But I don't know how to insert encode() function in code. Can anyone help me?

What is the full traceback of the exception?

Martijn Pieters
– Martijn Pieters

2015-06-10 08:34:32 +00:00
Commented Jun 10, 2015 at 8:34 — Martijn Pieters
– Martijn Pieters, Commented Jun 10, 2015 at 8:34

Martijn Pieters · Accepted Answer · 2019-01-07 18:55:45Z

117

In Python 3, files are opened as text (decoded to Unicode) for you; you don't need to tell BeautifulSoup what codec to decode from.

If decoding of the data fails, that's because you didn't tell the open() call what codec to use when reading the file; add the correct codec with an encoding argument:

with open(filename, encoding='utf8') as infile:
    html = BeautifulSoup(infile, "html.parser")

otherwise the file will be opened with your system default codec, which is OS dependent.

edited Jan 7, 2019 at 18:55

answered Jun 10, 2015 at 8:36

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Altair7852 Over a year ago

You can also add errors='ignore' to open(), in case you are not sure the file is 'utf-8' and you want to skip non-utf8 bytes, avoiding "UnicodeDecodeError: 'utf-8' codec can't decode byte" errors. From here: stackoverflow.com/questions/56453782/…

Martijn Pieters Over a year ago

@Altair7852 that’s... a risky option that only works if your input is some other ASCII superset codec.

Martijn Pieters Over a year ago

@Altair7852 the post you link to is specifically about reading a PDF file, which isn’t even a text file but a binary format. Opening it as text is the wrong thing to do.

Altair7852 Over a year ago

Martijn Pieters you are correct, the linked post is not very relevant here, except for the flag, and yes - only use it when you know whare you are doing. In defence, I had the utf8 issues popping up when reading html files, thus the comment.

Mantu Over a year ago

I am getting the same error because my configuration file contained few Chinese latter. I added 'utf-8' encoding in side config read function. Here is the code below. config.read('../conf/PM_AutomaticTariffUpload_Converter.conf',encoding = 'utf8')

Collectives™ on Stack Overflow

Python 3 UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related