1

I have dataframe like this:

  Date       ID     Age      Gender        Fruits
  1.1.19     1      50         F            Apple
  2.1.19     1      50         F            Mango
  2.1.19     1      50         F            Orange
  1.1.19     2      75         M            Grapes
  4.1.19     3      20         M            Apple
  4.1.19     3      20         M            Grapes

for example i have two lists:

fruits_list = ['Apple', 'Mango', 'Orange', 'Grapes', 'Banana', 'Guava']
date_list = ['1.1.19', '2.1.19', '3.1.19', '4.1.19', '5.1.19', '6.1.19']

I want to convert the Fruit column into further columns which gives binary info yes/no for each person. And the missing date should be NaN. by using this

pd.get_dummies(df, columns=['Fruits'], prefix='', prefix_sep='').groupby('Date').max()

i will get this but i need all the elements which are in fruit_list and date_list

Date          ID     Age      Gender        Apple    Mango      Orange    Grapes 
1.1.19         1      50         F            1        0           0         0
1.1.19         2      75         M            0        0           0         1
2.1.19         1      50         F            0        1           1         0 
3.1.19         NaN    NaN        NaN          NaN     NaN        NaN       NaN
4.1.19         3      20         M            1        0           0         1

Desired output would be like this.

Date             ID     Age      Gender        Apple    Mango      Orange    Grapes    Banana   Guava   sum
    1.1.19         1      50         F            1        0           0         0     0        0       1 
    1.1.19         2      75         M            0        0           0         1     0        0       1
    2.1.19         1      50         F            0        1           1         0     0        0       2
    3.1.19         NaN    NaN        NaN          0        0           0         0     0        0       0
    4.1.19         3      20         M            1        0           0         1     0        0       2
    5.1.19         NaN    NaN        NaN          0        0           0         0     0        0       0
    6.1.19         NaN    NaN        NaN          0        0           0         0     0        0        0
1
  • @jezrael can you help? Commented May 4, 2020 at 11:05

1 Answer 1

3

First is necessary datetimes in date_list and Date column for avoid missing values, then use DataFrame.reindex for add not exist Datetimes and columns:

df1 = (pd.get_dummies(df, columns=['Fruits'], prefix='', prefix_sep='')
         .groupby('Date')
         .max()
         .reindex(index=date_list, columns=fruits_list, fill_value=0)
         .assign(sum = lambda x: x.sum(axis=1)))
print (df1)

        Apple  Mango  Orange  Grapes  Banana  Guava  sum
Date                                                    
1.1.19      1      0       0       1       0      0    2
2.1.19      0      1       1       0       0      0    2
3.1.19      0      0       0       0       0      0    0
4.1.19      1      0       0       1       0      0    2
5.1.19      0      0       0       0       0      0    0
6.1.19      0      0       0       0       0      0    0

Then add original columns by DataFrame.merge and change order of columns by reindex:

cols = ['Date','ID','Age','Gender'] + fruits_list
df = df1.reset_index().merge(df, on='Date', how='left').reindex(cols, axis=1)
print (df)
     Date   ID   Age Gender  Apple  Mango  Orange  Grapes  Banana  Guava
0  1.1.19  1.0  50.0      F      1      0       0       1       0      0
1  1.1.19  2.0  75.0      M      1      0       0       1       0      0
2  2.1.19  1.0  50.0      F      0      1       1       0       0      0
3  2.1.19  1.0  50.0      F      0      1       1       0       0      0
4  3.1.19  NaN   NaN    NaN      0      0       0       0       0      0
5  4.1.19  3.0  20.0      M      1      0       0       1       0      0
6  4.1.19  3.0  20.0      M      1      0       0       1       0      0
7  5.1.19  NaN   NaN    NaN      0      0       0       0       0      0
8  6.1.19  NaN   NaN    NaN      0      0       0       0       0      0
Sign up to request clarification or add additional context in comments.

17 Comments

Cool. I forget to mention in my question, can we also get the sum of all the columns?
Here i lose the person attributes like age and gender?
when i use that logic then i am getting all the values 0 (NaN)
@s_khan92 - Problem is with df1, first paragraph?
@s_khan92 - Are datetimes in Date column and also are datetiems in list date_list ?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.