I'm doing an ultra-simple web page scraper using Python/Beautifulsoup. Facing a key information displayed as PNG image, I've had to reach for PIL/Pytesseract.
Code being extremely simple, and working when executed as my user. Image did load as print cmd shows, but image_to_string appears to generate the error.
encoded_img = 'iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8AAAAASUVORK5CYII='
# Decode and open as image
img_data = base64.b64decode(encoded_img)
img_bytes = BytesIO(img_data)
img = Image.open(img_bytes)
print(img.format, img.size, img.mode)
# Use pytesseract to extract number
custom_config = r'--psm 7 -c tessedit_char_whitelist=0123456789.,'
return pytesseract.image_to_string(img, config=custom_config).strip()
However, when running from a cron task, (after resolving venv and dependencies) I get the impossible message from the title:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 270: invalid start byte
Trying to set the LANG or LC_* env variables did not help.
I'm using python3 and macos-sonoma - not sure if that matters.
Any ideas?
Whybecause that file is not UTF8. This has nothing to do with the user's locale. The byte value0x89isn't valid in UTF8. Characters above the US-ASCII range (0x7F) use 2 or more bytesNowherelike this almost certainly duplicate question? ? You started with no code at all and an incomplete exception, making it impossible to guess where the error came from.tesseractis a tool, not a Python library. PyTesseract reads the tool's output, and yes, an incorrectLC_setting could result in decoding errors. Iftesseractreturns Latin1 when Python expects UTF8, you get a decoding error. While MacOS itself is Unicode, the shell and the terminal may use non-Unicode encodings