1

I have a dataset with a column 'Self_Employed'. In these columns are values 'Yes', 'No' and 'NaN. I want to replace the NaN values with a value that is calculated in calc(). I've tried some methods I found on here, but I couldn't find one that was applicable to me. Here is my code, I put the things i've tried in comments.:

    # Handling missing data - Self_employed
SEyes = (df['Self_Employed']=='Yes').sum()
SEno = (df['Self_Employed']=='No').sum()

def calc():
    rand_SE = randint(0,(SEno+SEyes))
    if rand_SE > 81:
        return 'No'
    else:
        return 'Yes'


> # df['Self_Employed'] = df['Self_Employed'].fillna(randint(0,100))
> #df['Self_Employed'].isnull().apply(lambda v: calc())
> 
> 
> # df[df['Self_Employed'].isnull()] = df[df['Self_Employed'].isnull()].apply(lambda v: calc())  
> # df[df['Self_Employed']]
> 
> # df_nan['Self_Employed'] = df_nan['Self_Employed'].isnull().apply(lambda v: calc())
> # df_nan
> 
> #  for i in range(df['Self_Employed'].isnull().sum()):
> #      print(df.Self_Employed[i]


df[df['Self_Employed'].isnull()] = df[df['Self_Employed'].isnull()].apply(lambda v: calc())
df

now the line where i tried it with df_nan seems to work, but then I have a separate set with only the former missing values, but I want to fill the missing values in the whole dataset. For the last row I'm getting an error, i linked to a screenshot of it. Do you understand my problem and if so, can you help?

This is the set with only the rows where Self_Employed is NaN

This is the original dataset

This is the error

0

3 Answers 3

1

Make shure that SEno+SEyes != null use the .loc method to set the value for Self_Employed when it is empty

SEyes = (df['Self_Employed']=='Yes').sum() + 1
SEno = (df['Self_Employed']=='No').sum()

def calc():
    rand_SE = np.random.randint(0,(SEno+SEyes))
    if(rand_SE >= 81):
        return 'No'
    else:
        return 'Yes'

df.loc[df['Self_Employed'].isna(), 'Self_Employed'] = df.loc[df['Self_Employed'].isna(), 'Self_Employed'].apply(lambda x: calc())
Sign up to request clarification or add additional context in comments.

2 Comments

This worked! I thank you for your help. Why the +1 though?
just in case SEno+SEyes == 0 because np.random.randint(0,0) doesn't work
0

What about df['Self_Employed'] = df['Self_Employed'].fillna(calc())?

1 Comment

This just does calc() once and used that for every row, instead of doing the calculation per row. I want the NaN's to be filled with Yes's and No's semi-random.
0

You could first identify the locations of your NaNs like

na_loc = df.index[df['Self_Employed'].isnull()]

Count the amount of NaNs in your column like

num_nas = len(na_loc)

Then generate an according amount of random numbers, readily indexed and set up

fill_values = pd.DataFrame({'Self_Employed': [random.randint(0,100) for i in range(num_nas)]}, index = na_loc)

And finally substitute those values in your dataframe

df.loc[na_loc]['Self_Employed'] = fill_values

2 Comments

So this in fact did fill the NaN's i intended to in my df, but it did also replace all the other values in the same row as the intended NaN row to NaN. So row 11 for example now is: NaN NaN NaN NaN NaN No NaN NaN NaN NaN NaN.
That is because I forgot to select the Self_Employed column in the assign statement. It is corrected now

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.