Create a dataframe based on another dataframe using unique values

Question

If I have a Pandas dataframe like so:

colA colB
 A    A1
 B    C1
 A    B1
 B    A1

colA has 2 unique values (A, B) and colB has 3 unique values (A1, B1 and C1).

I would like to create a new dataframe where colA and colB are all combinations and another column colC which is 1 or 0 based on the combination present in earlier df.

expected result:

colA colB colC
 A    A1   1
 A    B1   1
 A    C1   0
 B    A1   1
 B    B1   0
 B    C1   1

jezrael · Accepted Answer · 2019-04-16 06:43:28Z

5

First create new column by DataFrame.assign filled by 1, then create MultiIndex.from_product by Series.unique values of both columns and after DataFrame.set_index use DataFrame.reindex - there is possible set value for new appended rows in colC column by fill_value parameter:

mux = pd.MultiIndex.from_product([df['colA'].unique(),
                                  df['colB'].unique()], names=['colA','colB'])
df1 = df.assign(colC = 1).set_index(['colA','colB']).reindex(mux, fill_value=0).reset_index()
print (df1)
  colA  colB  colC
0      A  A1     1
1      A  C1     0
2      A  B1     1
3      B  A1     1
4      B  C1     1
5      B  B1     0

Alternative is use reshape by DataFrame.set_index, Series.unstack and DataFrame.stack:

df1 = (df.assign(colC = 1)
         .set_index(['colA','colB'])['colC']
         .unstack(fill_value=0)
         .stack()
         .reset_index(name='ColC'))

print (df1)
  colA colB  ColC
0    A   A1     1
1    A   B1     1
2    A   C1     0
3    B   A1     1
4    B   B1     0
5    B   C1     1

Another solution is create new DataFrame by itertools.product, DataFrame.merge with indicator=True, rename column and set by compare by both and casting to integer for True/False to 1/0 mapping:

from  itertools import product
df1 = pd.DataFrame(product(df['colA'].unique(), df['colB'].unique()), columns=['colA','colB'])
df = df1.merge(df, how='left', indicator=True).rename(columns={'_merge':'colC'})
df['colC'] = df['colC'].eq('both').astype(int)
print (df)
  colA colB  colC
0    A   A1     1
1    A   C1     0
2    A   B1     1
3    B   A1     1
4    B   C1     1
5    B   B1     0

Last if necessary add sorting by both columns by DataFrame.sort_values:

df1 = df1.sort_values(['colA','colB'])

edited Apr 16, 2019 at 6:43

answered Apr 16, 2019 at 6:28

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

theberzi Over a year ago

Could you add an explanation for what you're doing, especially for the assign() line?

jezrael Over a year ago

@FedericoS - Done :)

Collectives™ on Stack Overflow

Create a dataframe based on another dataframe using unique values

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related