Pandas dataframe replace

Question

New to Pandas and python and have a question on replacing multiple unicode characters within an entire data frame. Using python 2.7 and importing from an excel sheet. My desire is to replace all non-ascii characters with their ascii equivalent or nothing.

examples:
u'SHOGUN JAPANESE \u2013 GRAND'
u'COMFORT INN & SUITES\xa0STONE MOUNTAIN'

This works, but is cumbersome:

rawdf = rawdf["Account_Name"].str.upper().str.replace(u'\u2013', ' ').str.replace(u'\xa0', '-') + "|" + rawdf["COID"].str.upper()

This did not work:

rawdf = rawdf.replace(u'\u2013', ' ')

mdurant · Accepted Answer · 2016-08-26 18:17:47Z

1

You can do an encode/decode cycle like so:

rawdf["Account_Name"].str..encode('ascii', 'ignore').str.decode('ascii')

The use of 'ignore' makes characters that cannot be represented in ascii be dropped. The intermediate representation is bytes, so we need to encode it back to strings again.

answered Aug 26, 2016 at 18:17

mdurant

28.8k5 gold badges49 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Sean Over a year ago

Thank you for the suggestion, but now because of the ignore, the characters as you mention were dropped. I actually need to do a replacement. The downstream process needs to compare raw_set_1 to clean_set_2 to give me the difference. Currently the difference is caused by these unique characters. Other thoughts?

mdurant Over a year ago

You can use 'replace' instead, which will keep the character position and fill it, I believe, with a "?".

Collectives™ on Stack Overflow

Pandas dataframe replace

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related