-4

I'm doing an ultra-simple web page scraper using Python/Beautifulsoup. Facing a key information displayed as PNG image, I've had to reach for PIL/Pytesseract.

Code being extremely simple, and working when executed as my user. Image did load as print cmd shows, but image_to_string appears to generate the error.

    encoded_img = 'iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8AAAAASUVORK5CYII='

    # Decode and open as image
    img_data = base64.b64decode(encoded_img)
    img_bytes = BytesIO(img_data)
    img = Image.open(img_bytes)
    print(img.format, img.size, img.mode)

    # Use pytesseract to extract number
    custom_config = r'--psm 7 -c tessedit_char_whitelist=0123456789.,'

    return pytesseract.image_to_string(img, config=custom_config).strip()

However, when running from a cron task, (after resolving venv and dependencies) I get the impossible message from the title: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 270: invalid start byte

Trying to set the LANG or LC_* env variables did not help.

I'm using python3 and macos-sonoma - not sure if that matters.

Any ideas?

3
  • 1
    Why because that file is not UTF8. This has nothing to do with the user's locale. The byte value 0x89 isn't valid in UTF8. Characters above the US-ASCII range (0x7F) use 2 or more bytes Commented Nov 13 at 13:13
  • 2
    Nowhere like this almost certainly duplicate question? ? You started with no code at all and an incomplete exception, making it impossible to guess where the error came from. tesseract is a tool, not a Python library. PyTesseract reads the tool's output, and yes, an incorrect LC_ setting could result in decoding errors. If tesseract returns Latin1 when Python expects UTF8, you get a decoding error. While MacOS itself is Unicode, the shell and the terminal may use non-Unicode encodings Commented Nov 13 at 13:31
  • I don't get any error on Windows, with Python 3.14 and the latest versions of Tesseract and PyTesseract. That image though is empty. With actual screen captures I get results in both English and other languages. Only emojis are missed. Commented Nov 13 at 14:39

1 Answer 1

-3

After dumping my entire user environment and loading within my script - I got the script to run successfully.

Eliminating every other variable, I was down to TMPDIR that defaulted to /tmp and tesseract was apparently unable to write to it.

Ironically, when I pointed to a known dir - the script left it empty. Not sure if it was cleaned up before quitting, but I'm pretty confused, and suspect a bug in tesseract or somewhere.

Finally, setting the TMPDIR to a known and existing path (non-/tmp obviously), I'm up and running.

Sign up to request clarification or add additional context in comments.

3 Comments

Encoding errors have nothing to do with temporary file locations. Sounds like you were reading the wrong file. Or worse, trying to read a binary file as if it was text? suspect a bug in tesseract or somewhere. no, it's a bug in the code. Which was never posted
Thank you for the code reminder - just added. And no - code leaves no room for a bug.
In that case you'll be able to reproduce this by saving the image to disk and using tesseract directly with the same command line. Except tesseract isn't a Python application. It's quite possible the wrong LC encoding makes tesseract returns eg Latin1 when Python expects UTF8

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.