How to create a unique identifier based on multiple columns?

Question

I have a pandas dataframe that looks something like this:

    brand       description     former_price    discounted_price
0   A           icecream        1099.0          855.0   
1   A           cheese          469.0           375.0   
2   B           catfood         179.0           119.0   
3   C           NaN             699.0           399.0   
4   NaN         icecream        769.0           549.0
5   A           icecream        769.0           669.0

I want to create a column that will assign a unique value for each brand & description combination. Note that either the brand or the description can be missing from the dataset (notified by NaN value). Also, note that if the brand and the description is the same (duplicated) I still want the unique value to be the same for the row.

The output should look like this:

    product_key   brand         description     former_price    discounted_price
0   1             A             icecream        1099.0          855.0   
1   2             A             cheese          469.0           375.0   
2   3             B             catfood         179.0           119.0   
3   4             C             NaN             699.0           399.0   
4   5             NaN           icecream        769.0           549.0
5   1             A             icecream        769.0           669.0

The values in product_key can be anything, I just want them to be unique based on brand and description columns. Any help is immensely appreciated!

Thanks a lot!

MrNobody33 · Accepted Answer · 2020-07-15 15:53:44Z

11

You could try with pd.Series.factorize:

df.set_index(['brand','description']).index.factorize()[0]+1

Output:

So you could try this, to assign it to be the first column:

df.insert(loc=0, column='product_key', value=df.set_index(['brand','description']).index.factorize()[0]+1)

Output:

df
   product_key brand description  former_price  discounted_price
0            1     A    icecream        1099.0             855.0
1            2     A      cheese         469.0             375.0
2            3     B     catfood         179.0             119.0
3            4     C         NaN         699.0             399.0
4            5   NaN    icecream         769.0             549.0
5            1     A    icecream         769.0             669.0

edited Jul 15, 2020 at 15:53

answered Jul 15, 2020 at 15:43

MrNobody33

6,5039 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

eagerstudent Over a year ago

this looks great! I will tested it on my dataset and it works flawlessly

anky · Accepted Answer · 2020-07-15 15:38:15Z

4

with groupby+ngroup:

(df.fillna({'brand':'','description':''})
   .groupby(['brand','description'],sort=False).ngroup()+1)

answered Jul 15, 2020 at 15:38

anky

75.3k11 gold badges46 silver badges76 bronze badges

2 Comments

eagerstudent Over a year ago

I tested this and it works! Just one question: why do you need to add +1 to it? is is because the ngroup count starts from 0?

anky Over a year ago

@eagerstudent exactly for that purpose. :)

Collectives™ on Stack Overflow

How to create a unique identifier based on multiple columns?

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related