UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 270: invalid start byte - Why? [closed]

Question

Closed. This question needs debugging details. It is not currently accepting answers.

Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.

Closed 16 days ago.

This post was edited and submitted for review 16 days ago and failed to reopen the post:

Not suitable for this site This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.

Improve this question

I'm doing an ultra-simple web page scraper using Python/Beautifulsoup. Facing a key information displayed as PNG image, I've had to reach for PIL/Pytesseract.

Code being extremely simple, and working when executed as my user. Image did load as print cmd shows, but image_to_string appears to generate the error.

    encoded_img = 'iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8AAAAASUVORK5CYII='

    # Decode and open as image
    img_data = base64.b64decode(encoded_img)
    img_bytes = BytesIO(img_data)
    img = Image.open(img_bytes)
    print(img.format, img.size, img.mode)

    # Use pytesseract to extract number
    custom_config = r'--psm 7 -c tessedit_char_whitelist=0123456789.,'

    return pytesseract.image_to_string(img, config=custom_config).strip()

However, when running from a cron task, (after resolving venv and dependencies) I get the impossible message from the title: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 270: invalid start byte

Trying to set the LANG or LC_* env variables did not help.

I'm using python3 and macos-sonoma - not sure if that matters.

Any ideas?

Why because that file is not UTF8. This has nothing to do with the user's locale. The byte value 0x89 isn't valid in UTF8. Characters above the US-ASCII range (0x7F) use 2 or more bytes — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Nov 13 at 13:13
Nowhere like this almost certainly duplicate question? ? You started with no code at all and an incomplete exception, making it impossible to guess where the error came from. tesseract is a tool, not a Python library. PyTesseract reads the tool's output, and yes, an incorrect LC_ setting could result in decoding errors. If tesseract returns Latin1 when Python expects UTF8, you get a decoding error. While MacOS itself is Unicode, the shell and the terminal may use non-Unicode encodings — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Nov 13 at 13:31
I don't get any error on Windows, with Python 3.14 and the latest versions of Tesseract and PyTesseract. That image though is empty. With actual screen captures I get results in both English and other languages. Only emojis are missed. — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Nov 13 at 14:39

tishma · Accepted Answer · 2025-11-13 13:07:35Z

-3

After dumping my entire user environment and loading within my script - I got the script to run successfully.

Eliminating every other variable, I was down to TMPDIR that defaulted to /tmp and tesseract was apparently unable to write to it.

Ironically, when I pointed to a known dir - the script left it empty. Not sure if it was cleaned up before quitting, but I'm pretty confused, and suspect a bug in tesseract or somewhere.

Finally, setting the TMPDIR to a known and existing path (non-/tmp obviously), I'm up and running.

answered Nov 13 at 13:07

tishma

1,8731 gold badge23 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Panagiotis Kanavos Nov 13 at 13:14

Encoding errors have nothing to do with temporary file locations. Sounds like you were reading the wrong file. Or worse, trying to read a binary file as if it was text? suspect a bug in tesseract or somewhere. no, it's a bug in the code. Which was never posted

tishma Nov 13 at 13:21

Thank you for the code reminder - just added. And no - code leaves no room for a bug.

Panagiotis Kanavos Nov 13 at 13:33

In that case you'll be able to reproduce this by saving the image to disk and using tesseract directly with the same command line. Except tesseract isn't a Python application. It's quite possible the wrong LC encoding makes tesseract returns eg Latin1 when Python expects UTF8

Collectives™ on Stack Overflow

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 270: invalid start byte - Why? [closed]

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Linked

Related