0

The goal is to fill the nan values in a column with a random number chosen from that same column.

I can do this one column as a time but when iterating through all the columns in the data frame I get a variety of errors. When I use "random.choice" I get letters rather than column values.

 df1 = df_na
 df2 = df_nan.dropna()

 for i in range(5):
    for j in range(len(df1)):
        if np.isnan(df1.iloc[j,i]):
           df1.iloc[j,i] = np.random.choice(df2.columns[i])

 df1

Any suggestions on how to move forward?

1
  • Please add a small sample input and the corresponding expected output Commented Jan 23, 2019 at 21:45

2 Answers 2

1

You can do:

# sample data
df =pd.DataFrame({'a':[1,2,None,18,20,None],
                  'b': [22,33,44,None,100,32]})

# fill missing with a random value from that column
for col in df.columns:
    df[col].fillna(df[col].dropna().sample().values[0], inplace=True)

      a      b
0   1.0     22.0
1   2.0     33.0
2   20.0    44.0
3   18.0    100.0
4   20.0    100.0
5   20.0    32.0
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks that worked perfectly! This approach is different than others I have seen so it was helpful and informative.
Follow up question. This method made is so that all nan values in a given column were replaced by the same values. Is there a method so that each each row of a column is treated independently and a new random sample is taken to fill each individual nan value?
1

You can use pd.DataFrame.apply with np.random.choice:

df = df.apply(lambda s: s.fillna(np.random.choice(s.dropna())))

5 Comments

This worked and it using the same .apply function I was trying to use originally. I was getting errors when trying to iterate through columns using the for loop. Thank you for the insight!
One more question, is "s" referencing data frame df? Will the variable also reference the data frame? For example in: speeds_df.apply(lambda sp: sp.fillna(0)) Will sp reference data frame speeds_df?
I've used s to stand for "series", it represents each column, you can choose any letter you like though.
Follow up question. This method made is so that all nan values in a given column were replaced by the same values. Is there a method so that each each row of a column is treated independently and a new random sample is taken to fill each individual nan value?
@Dee, Probably, but that's a new question which you should ask separately. If an answer here solves your original problem, do accept it (tick on left) so other users know.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.