0

I'm trying to pre-process some data for machine learning purposes. I'm currently trying to clean up some NaN values and replace them with 'unknown' and a prefix or suffix which is based on the column name.

The reason for this is when I'm use one hot encoding, I can't have multiple columns with the same name being fed into xgboost.

So what I have is the following

df = df.apply(lambda x: x.replace(np.nan, 'unknown'))

And I'd like to replace all instances of NaN in the df with 'unknown_columname'. Is there any easy or simple way to do this?

2
  • 1
    Try df = df.apply(lambda x: x.replace(np.nan, f'unknown_{x.name}')). You can also use df = df.apply(lambda x: x.fillna(f'unknown_{x.name}')) Commented Sep 9, 2020 at 21:57
  • This is perfect and goes along well with my code! If you submit this as an answer, I'd like to give you the points! Commented Sep 9, 2020 at 22:13

2 Answers 2

2

Try df = df.apply(lambda x: x.replace(np.nan, f'unknown_{x.name}')).

You can also use df = df.apply(lambda x: x.fillna(f'unknown_{x.name}').

Sign up to request clarification or add additional context in comments.

Comments

1

First let's create the backup array to be filled whenever we have a missing value

s = np.core.defchararray.add('unknown',df.columns.values)

Then we can simply replace each NaN with the right value from s:

cols = df.columns.values
for col_name in cols:
    df.col_name.fillna(s, inplace=True)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.