1

I have a large dataset in pandas. For brevity, let's say I have the following

df = pd.DataFrame({'col1': [101,101,101,201,201,201,np.nan],
                  'col2':[123,123,124,np.nan,321,321,456],
                 'col3':['a',0.7,0.6,1.01,2,1,2],
                 'col4':['w',0.2,'b',0.7,'z',2,3],
                 'col5':[21,'z',0.3,2.3,0.8,'z',1.001],
                 'col6':[11.3,202.0,0.2,0.3,41.0,47,2],
                 'col7':['A','B','C','D','E','F','G']})

Initial Data

Now I want to create categorical variables with the suffix _missing such that for any column in the dataset that contains missing nan a new column (variable) should be created that has values 1 for 'nan' values and 0 otherwise. For example, for col1 and col2, their corresponding variables will be col1_missing and col2_missing.

Then for columns like col3 that have alphabets in a column that is supposed to be numeric, I will like similar result as described above, but with the levels of categories increasing with the number of different alphabets. For example the new column corresponding to col4 will be col4_missing and will contain 0 for non-alphabets, 1 for b, 2 for w and 3 for z. So the resulting frame should look as below:

Resulting dataframe

Is there any python function or package to do this? As a newbie, I am honestly overwhelmed with this and I would be grateful for any help on this.

1 Answer 1

1

You can map the values from a dictionary:

def flag(s):
    flags = {'b': 1, 'w': 2, 'z': 3}
    return s.fillna('b').map(lambda x: flags.get(x, 0))

out = (pd
 .concat([df, df.apply(flag).add_suffix('_missing')], axis=1)
 .sort_index(axis=1)
 )

Output:

    col1  col1_missing   col2  col2_missing  col3  col3_missing col4  col4_missing   col5  col5_missing   col6  col6_missing col7  col7_missing
0  101.0             0  123.0             0     a             0    w             2     21             0   11.3             0    A             0
1  101.0             0  123.0             0   0.7             0  0.2             0      z             3  202.0             0    B             0
2  101.0             0  124.0             0   0.6             0    b             1    0.3             0    0.2             0    C             0
3  201.0             0    NaN             1  1.01             0  0.7             0    2.3             0    0.3             0    D             0
4  201.0             0  321.0             0     2             0    z             3    0.8             0   41.0             0    E             0
5  201.0             0  321.0             0     1             0    2             0      z             3   47.0             0    F             0
6    NaN             1  456.0             0     2             0    3             0  1.001             0    2.0             0    G             0

only columns with at least one non-zero

def flag(s):
    flags = {'b': 1, 'w': 2, 'z': 3}
    return s.fillna('b').map(lambda x: flags.get(x, 0))

# flag values 
df2 = df.apply(flag).add_suffix('_missing')

# keep only columns with at least one flag
df2 = df2.loc[:, df2.ne(0).any()]

out = (pd
 .concat([df, df2], axis=1)
 .sort_index(axis=1)
 )
Sign up to request clarification or add additional context in comments.

5 Comments

mozway: This looks extremely helpful. Two things left to help me achieve my aim: 1. 'col6' and 'col7' should not be affected. 'Col7' is already character column and the entries there are correct categories. They are not missing categories. Could a list of the FALSE character columns be generated and parsed? 2. I feel 'z' in one column could take the value '1' if it is the only alphabet in that column among the floats or integers. But I can live with this second bit as it makes sense for 'z' to have the same meaning across the columns.
To ignore columns use: df.drop(columns=['col6', 'col7']).apply(...). For the rest, it depends on what you want exactly ;)
I think with my large data, it will indeed be tedious to list the columns 6 and 7 types of columns to drop. Perhaps it would be helpful to implement a way of checking if a column has the suffix '''_missing''' and have '''0'''s through the entire column then they should be dropped.
Easy enough, see update
This is super great. And actually the keys in the flags dictionary should start from 'a'. One last thing: Can you help implement issue number 2 I raised in my first comment? That is the option where if a columns has say only z as the alphabet, the corresponding missing column will render it 1. Similarly if say b and w are the only alphabets present, the missing column should render them 1 and 2, respectively? I would like to use test the two options in my analysis.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.