1

I want to substitude a substring with a hash - said substring contains non-ascii caracters, so I tried to encode it to UTF-8.

result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)', lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4).encode()).hexdigest(), line.encode('utf-8'))

I am not realy sure why this doesn't work, I thought with line.encode('utf-8'), the whole string is getting encoded. I also tried to encode my m.groups to UTF-8, but I got the same UnicodeDecodeError.

[unicodedecodeerror: 'ascii' codec can't decode byte in position ordinal not in range(128)]

Sample input:

Start: myUsername: myÜsername:

What am I missing ?

EDIT_

Traceback (most recent call last):
  File "C:/Users/Peter/Desktop/coding/filter.py", line 26, in <module>
    encodeline = line.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 112: ordinal not in range(128)
5
  • Could you please post the example input and the stacktrace of the error you mention? (generally, your question does not seem to be MCVE). Commented Oct 26, 2018 at 12:33
  • your are right, sorry - I put in some more information Commented Oct 26, 2018 at 12:37
  • Is this Python 2 or Python 3 code? I strongly suspect your problem is that you're running on Python 2, and trying to encode a str (which is a largely nonsensical thing to do). A full traceback and a minimal reproducible example would be helpful. Lastly, to be sure, split up the line so you only encode once per line, e.g. encodedline = line.encode('utf-8'), then replace line.encode('utf-8') in the re.sub with encodedline so you aren't able to confuse which encode is the problem. Commented Oct 26, 2018 at 13:00
  • I am running python 2.7 - is there a way to solve this problem or should I go with the "hack" ? Commented Oct 26, 2018 at 13:02
  • @peacemaker: The hack is a bad idea (setdefaultencoding is deleted from sys after calling it for a reason; changing the default mid-run risks all sorts of problems from various libraries that may have cached the encoding, or the results of encoding things in it, and suddenly find that things aren't behaving the way they did at startup). I strongly suspect your code will work by deleting all calls to encode in that line; you already had UTF-8 encoded data, so trying to encode it again was the source of your problems. See my answer. Commented Oct 26, 2018 at 13:29

3 Answers 3

1

Based on your symptoms, you're running on Python 2. Calling encode on a Python 2 str is almost always nonsensical.

You have two problems; one you're hitting now, and one you'll hit if you fix your current code.

Your first problem is line is already a str in (apparently) UTF-8 encoded bytes, not unicode, so encodeing it implicitly decodes with Python's default encoding (ASCII; this isn't locale specific to my knowledge, and it's a rare Python 2 install that uses anything else), then re-encodes with the specified codec (or the default if not specified). Basically, line was already UTF-8 encoded, you told it to encode again as UTF-8, but that's nonsensical, so Python tried to decode as ASCII first, and failed before it even tried to encode as you instructed.

The solution to this problem is to just not encode line at all; it's already UTF-8 encoded, so you're already golden.

Your second problem (which you haven't encountered yet, but you will) is that you're calling encode on the group(4) result. But of course, since the input was a str, the group is a str too, and you'll encounter the same problem trying to encode a str; since the group came from raw UTF-8 encoded bytes, the non-ASCII parts of it cause a UnicodeDecodeError during the implicit decode step before the encode.

The reason:

import sys

reload(sys)
sys.setdefaultencoding('UTF8')

works is that it (dangerously) changes the implicit decode step to use UTF-8, so all your encode calls now perform the implicit decode with UTF-8 instead of ASCII; the decode and encode is mostly pointless, since all it does is return the original str after confirming it's legal UTF-8 by means of decodeing it as such, and otherwise acting as an expensive no-op.

To fix the second problem, just change:

m.group(4).encode()

to:

m.group(4)

That leaves your final code as:

result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)',
                lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4)).hexdigest(),
                line)

Optionally, if you want to confirm your expectation that line is in fact UTF-8 encoded bytes already, add the following above that re.sub line:

try:
    line.decode('utf-8')
except Exception as e:
    sys.exit("line (of type {!r}) not decodable as UTF-8: {}".format(line.__class__.__name__, e))

which will cause the program to exit immediately if the data given is not legal UTF-8 (and will also let you know what type line is, so you can confirm for sure if it's really str or unicode, since str implies you chose the wrong codec, while unicode means your inputs aren't of the expected type).

Sign up to request clarification or add additional context in comments.

Comments

0

I found .. in my eyes a workaround. Doesn't feel right though, but it does the job.

import sys

reload(sys)
sys.setdefaultencoding('UTF8')

I thought it could be done with .encode('utf-8')

1 Comment

This is only a hack, but not the real solution. But we can't help without knowing more about your string etc
0
file = 'xyz'
res = hashlib.sha224(str(file).encode('utf-8)).hexdigest()

Because of unicode object must be encode as string before hash.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.