String comparison in Python 3

Question

My string comparison doesn't work,

Any ideas?

a = person.category[0].lower()
b = to_delete[5].lower()

print("test ", repr(a), "type: ", type(a))
print("test ", repr(b), "type: ", type(b))
print(a == b)
print(a is b)
print("éclairage public" == b)
print("éclairage public" == a )

returns:

test  'éclairage public' type:  <class 'str'>
test  'éclairage public' type:  <class 'str'>
False
False
False
True

So "b" doesn't have the expected composition but I don't know why!

We can't reproduce the issue if you don't provide the contents of person.category and to_delete. — glhr
– glhr, Commented Apr 30, 2019 at 15:30
You probably have Unicode normalization issues. Compare the values of a.encode() and b.encode() — chepner
– chepner, Commented Apr 30, 2019 at 15:31
Compare the results of b'e\xcc\x81'.decode() and b'\xc3\xa9'.decode(). Both look like é, but they are two different Unicode strings. — chepner
– chepner, Commented Apr 30, 2019 at 15:35

chepner · Accepted Answer · 2019-04-30 15:49:34Z

4

Your problem is almost certainly that a and b are two different Unicode values with the same normalization. As a simple example, consider these two ways to display é:

>>> b'e\xcc\x81'.decode()
'é'
>>> b'\xc3\xa9'.decode()
'é'

The first is a two-character string consisting of e (U+0065) and the combining diacrtical mark ´ (U+0301). The second is a single character consisting of é (U+00E9).

In order to compare them successfully, you need to normalize them. There are several different normalizations available, though which one you use doesn't matter much for comparison purposes as long as you use the same one for each.

>>> import unicodedata
>>> x = b'e\xcc\x81'.decode()
>>> y = b'\xc3\xa9'.decode()
>>> x == y
False
>>> unicodedata.normalize("NFC", x) == unicodedata.normalize("NFC", y)
True

Normalization NFC, for example, normalizes by replacing U+0065/U+0301 with U+00E9. For more information, see https://www.unicode.org/faq/normalization.html. You will probably want to normalize any user input before storing it, and you'll want to make sure that the same normalization is used for all stored data. The FAQ may help you decide which normalization is most appropriate for your use.

edited Apr 30, 2019 at 15:49

answered Apr 30, 2019 at 15:42

chepner

538k77 gold badges594 silver badges746 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Alex Over a year ago

A cleaner way should be to encode/decode strings, but I don't know how to do that. On of the entry comes from web (selenium), the second one comes from the creation of a manual string.

chepner Over a year ago

Encoding/decoding is a separate issue. You need to decode a UTF-8 byte string before you can normalize the result, and you (may) need to encode the normalized result back to UTF-8 for storage.

Collectives™ on Stack Overflow

String comparison in Python 3

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related