Creating missing dummy indicator variable for a list of variable in a dataframe with specific values in python (pandas)

Question

I have a large dataset in pandas. For brevity, let's say I have the following

df = pd.DataFrame({'col1': [101,101,101,201,201,201,np.nan],
                  'col2':[123,123,124,np.nan,321,321,456],
                 'col3':['a',0.7,0.6,1.01,2,1,2],
                 'col4':['w',0.2,'b',0.7,'z',2,3],
                 'col5':[21,'z',0.3,2.3,0.8,'z',1.001],
                 'col6':[11.3,202.0,0.2,0.3,41.0,47,2],
                 'col7':['A','B','C','D','E','F','G']})

Now I want to create categorical variables with the suffix _missing such that for any column in the dataset that contains missing nan a new column (variable) should be created that has values 1 for 'nan' values and 0 otherwise. For example, for col1 and col2, their corresponding variables will be col1_missing and col2_missing.

Then for columns like col3 that have alphabets in a column that is supposed to be numeric, I will like similar result as described above, but with the levels of categories increasing with the number of different alphabets. For example the new column corresponding to col4 will be col4_missing and will contain 0 for non-alphabets, 1 for b, 2 for w and 3 for z. So the resulting frame should look as below:

Is there any python function or package to do this? As a newbie, I am honestly overwhelmed with this and I would be grateful for any help on this.

mozway · Accepted Answer · 2022-09-24 05:52:21Z

1

You can map the values from a dictionary:

def flag(s):
    flags = {'b': 1, 'w': 2, 'z': 3}
    return s.fillna('b').map(lambda x: flags.get(x, 0))

out = (pd
 .concat([df, df.apply(flag).add_suffix('_missing')], axis=1)
 .sort_index(axis=1)
 )

Output:

    col1  col1_missing   col2  col2_missing  col3  col3_missing col4  col4_missing   col5  col5_missing   col6  col6_missing col7  col7_missing
0  101.0             0  123.0             0     a             0    w             2     21             0   11.3             0    A             0
1  101.0             0  123.0             0   0.7             0  0.2             0      z             3  202.0             0    B             0
2  101.0             0  124.0             0   0.6             0    b             1    0.3             0    0.2             0    C             0
3  201.0             0    NaN             1  1.01             0  0.7             0    2.3             0    0.3             0    D             0
4  201.0             0  321.0             0     2             0    z             3    0.8             0   41.0             0    E             0
5  201.0             0  321.0             0     1             0    2             0      z             3   47.0             0    F             0
6    NaN             1  456.0             0     2             0    3             0  1.001             0    2.0             0    G             0

only columns with at least one non-zero

def flag(s):
    flags = {'b': 1, 'w': 2, 'z': 3}
    return s.fillna('b').map(lambda x: flags.get(x, 0))

# flag values 
df2 = df.apply(flag).add_suffix('_missing')

# keep only columns with at least one flag
df2 = df2.loc[:, df2.ne(0).any()]

out = (pd
 .concat([df, df2], axis=1)
 .sort_index(axis=1)
 )

edited Sep 24, 2022 at 5:52

answered Sep 24, 2022 at 3:29

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Lomnewton Over a year ago

mozway: This looks extremely helpful. Two things left to help me achieve my aim: 1. 'col6' and 'col7' should not be affected. 'Col7' is already character column and the entries there are correct categories. They are not missing categories. Could a list of the FALSE character columns be generated and parsed? 2. I feel 'z' in one column could take the value '1' if it is the only alphabet in that column among the floats or integers. But I can live with this second bit as it makes sense for 'z' to have the same meaning across the columns.

mozway Over a year ago

To ignore columns use: df.drop(columns=['col6', 'col7']).apply(...). For the rest, it depends on what you want exactly ;)

Lomnewton Over a year ago

I think with my large data, it will indeed be tedious to list the columns 6 and 7 types of columns to drop. Perhaps it would be helpful to implement a way of checking if a column has the suffix '''_missing''' and have '''0'''s through the entire column then they should be dropped.

mozway Over a year ago

Easy enough, see update

Lomnewton Over a year ago

This is super great. And actually the keys in the flags dictionary should start from 'a'. One last thing: Can you help implement issue number 2 I raised in my first comment? That is the option where if a columns has say only z as the alphabet, the corresponding missing column will render it 1. Similarly if say b and w are the only alphabets present, the missing column should render them 1 and 2, respectively? I would like to use test the two options in my analysis.

Collectives™ on Stack Overflow

Creating missing dummy indicator variable for a list of variable in a dataframe with specific values in python (pandas)

1 Answer 1

only columns with at least one non-zero

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

only columns with at least one non-zero

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related