2

New to Pandas and python and have a question on replacing multiple unicode characters within an entire data frame. Using python 2.7 and importing from an excel sheet. My desire is to replace all non-ascii characters with their ascii equivalent or nothing.

examples:
u'SHOGUN JAPANESE \u2013 GRAND'
u'COMFORT INN & SUITES\xa0STONE MOUNTAIN'

This works, but is cumbersome:

rawdf = rawdf["Account_Name"].str.upper().str.replace(u'\u2013', ' ').str.replace(u'\xa0', '-') + "|" + rawdf["COID"].str.upper()

This did not work:

rawdf = rawdf.replace(u'\u2013', ' ')

1 Answer 1

1

You can do an encode/decode cycle like so:

rawdf["Account_Name"].str..encode('ascii', 'ignore').str.decode('ascii')

The use of 'ignore' makes characters that cannot be represented in ascii be dropped. The intermediate representation is bytes, so we need to encode it back to strings again.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for the suggestion, but now because of the ignore, the characters as you mention were dropped. I actually need to do a replacement. The downstream process needs to compare raw_set_1 to clean_set_2 to give me the difference. Currently the difference is caused by these unique characters. Other thoughts?
You can use 'replace' instead, which will keep the character position and fill it, I believe, with a "?".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.