0

My string comparison doesn't work,

Any ideas?

a = person.category[0].lower()
b = to_delete[5].lower()

print("test ", repr(a), "type: ", type(a))
print("test ", repr(b), "type: ", type(b))
print(a == b)
print(a is b)
print("éclairage public" == b)
print("éclairage public" == a )

returns:

test  'éclairage public' type:  <class 'str'>
test  'éclairage public' type:  <class 'str'>
False
False
False
True

So "b" doesn't have the expected composition but I don't know why!

16
  • What are the outputs of a simple print(a) and print(b)? Commented Apr 30, 2019 at 15:21
  • Obviously I'd like all to be true! Commented Apr 30, 2019 at 15:22
  • 1
    We can't reproduce the issue if you don't provide the contents of person.category and to_delete. Commented Apr 30, 2019 at 15:30
  • 1
    You probably have Unicode normalization issues. Compare the values of a.encode() and b.encode() Commented Apr 30, 2019 at 15:31
  • 2
    Compare the results of b'e\xcc\x81'.decode() and b'\xc3\xa9'.decode(). Both look like é, but they are two different Unicode strings. Commented Apr 30, 2019 at 15:35

1 Answer 1

4

Your problem is almost certainly that a and b are two different Unicode values with the same normalization. As a simple example, consider these two ways to display é:

>>> b'e\xcc\x81'.decode()
'é'
>>> b'\xc3\xa9'.decode()
'é'

The first is a two-character string consisting of e (U+0065) and the combining diacrtical mark ´ (U+0301). The second is a single character consisting of é (U+00E9).

In order to compare them successfully, you need to normalize them. There are several different normalizations available, though which one you use doesn't matter much for comparison purposes as long as you use the same one for each.

>>> import unicodedata
>>> x = b'e\xcc\x81'.decode()
>>> y = b'\xc3\xa9'.decode()
>>> x == y
False
>>> unicodedata.normalize("NFC", x) == unicodedata.normalize("NFC", y)
True

Normalization NFC, for example, normalizes by replacing U+0065/U+0301 with U+00E9. For more information, see https://www.unicode.org/faq/normalization.html. You will probably want to normalize any user input before storing it, and you'll want to make sure that the same normalization is used for all stored data. The FAQ may help you decide which normalization is most appropriate for your use.

Sign up to request clarification or add additional context in comments.

2 Comments

A cleaner way should be to encode/decode strings, but I don't know how to do that. On of the entry comes from web (selenium), the second one comes from the creation of a manual string.
Encoding/decoding is a separate issue. You need to decode a UTF-8 byte string before you can normalize the result, and you (may) need to encode the normalized result back to UTF-8 for storage.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.