3

I'm using beautiful soup to parse email invoices and I'm running into consistent problem involving special characters.

The text I am trying to parse is shown in the image.

enter image description here

But what I get from beautiful soup after finding the element and calling elem.text is this:

'Hi Mike, It=E2=80=\r\n=99s probably not a big drama if you are having problems separating product=\r\ns from classes. It is not uncommon to receive an order for pole classes and=\r\n a bottle of Dry Hands.\r\nAlso, remember that we will have just straight up product orders that your =\r\nsystem will not be able to place into a class list, hence having the extra =\r\nsheet for any =E2=80=9Cerroneous=E2=80=9D orders will be handy.'

As you can see the apostrophe is now represented by "=E2=80=99", double quotes are "=E2=80=9C" and "=E2=80=9D" and there are seemingly random newlines in the text, for example "product=\r\ns". The newlines don't seem to appear in the image.

Apparently "E2 80 99" is the unicode hex representation of ' , but I don't understand why I can still see it in this form after having done email.decode('utf-8') before sending it to beautiful soup.

This is the element

<td border:="" class='3D"td"' left="" middle="" padding:="" solid="" style='3D"color:' text-align:="" v="ertical-align:">Hi Mike, It=E2=80=
=99s probably not a big drama if you are having problems separating product=
s from classes. It is not uncommon to receive an order for pole classes and=
 a bottle of Dry Hands.
Also, remember that we will have just straight up product orders that your =
system will not be able to place into a class list, hence having the extra =
sheet for any =E2=80=9Cerroneous=E2=80=9D orders will be handy.</td>

I can post my code if required but I figure I must be making a simple mistake.

I checked out the answer to this question Decode Hex String in Python 3 but i think that expects the entire string to be hex rather than just having random hex parts. but I'm honestly not even sure how to search for "decode partial hex strings"

My final questions are

Q1 How do I convert

'Hi Mike, It=E2=80=\r\n=99s probably not a big drama if you are having problems separating product=\r\ns from classes. It is not uncommon to receive an order for pole classes and=\r\n a bottle of Dry Hands.\r\nAlso, remember that we will have just straight up product orders that your =\r\nsystem will not be able to place into a class list, hence having the extra =\r\nsheet for any =E2=80=9Cerroneous=E2=80=9D orders will be handy.'

into

'Hi Mike, It's probably not a big drama if you are having problems separating products from classes. It is not uncommon to receive an order for pole classes and=\r\n a bottle of Dry Hands.Also, remember that we will have just straight up product orders that your system will not be able to place into a class list, hence having the extra sheet for any "erroneous" orders will be handy.'

using python 3, without manually fixing each string and writing a replace method for each possible character.

Q2 Why does this "=\r\n" appear everywhere in my string but not in the rendered html?

2

1 Answer 1

1

@JosefZ's comment lead me to the answer.

Q1 has an answer.

>>> import quopri
>>> print(quopri.decodestring(mystring).decode('utf-8'))
Hi Mike, It’s probably not a big drama if you are having problems separating products from classes. It is not uncommon to receive an order for pole classes and a bottle of Dry Hands.
Also, remember that we will have just straight up product orders that your system will not be able to place into a class list, hence having the extra sheet for any “erroneous” orders will be handy.

Q2 Thanks to @snakecharmerb I now know that the seemingly random unrepresented line endings are to enforce a line length of 80 characters.

@snakecharmerb wrote a much better answer than this one to someone with the same problem as me here. https://stackoverflow.com/a/55295640/992644

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.