Pandas: How to give a unique id for a string when it appears in several rows?

Question

I have a pandas dataframe looking like this:

ner_id  art_id  ner
0       0      emmanuel macron
1       0      paris
2       0      france
3       1      paris
4       0      france

I would like to change the column 'ner_id'.

For example, paris appears in the article with id 0 and also 1 (see art_id column).

I would like to only change the column ner_id and give a unique id for paris and not a different id.

I want to do this in the column everytime a word is repeating in the column and give the repeating word the same id.

How can I do it ?

Expected output:

ner_id  art_id  ner
    0       0      emmanuel macron
    1       0      paris
    2       0      france
    1       1      paris
    2       0      france

I would to give first id of the term everytime a term is being repeated in the next rows.

ner_id already is unique. Can you include your expected output for this example — Paul H
– Paul H, Commented Nov 17, 2020 at 17:43
I think what you want is to group by ner and assign ids by these groups. If your data frame is called df, you could try df['ner_id'] = df.groupby('ner').ngroup(). — jtorca
– jtorca, Commented Nov 17, 2020 at 17:45
Why can't the ner column itself serve as the unique identifier? — Paul H
– Paul H, Commented Nov 17, 2020 at 17:48
I've included my comment in the solution below. If you can explain why it doesn't work for you I can update it. — jtorca
– jtorca, Commented Nov 17, 2020 at 17:53

jtorca · Accepted Answer · 2020-11-17 17:51:55Z

2

I'll just put into an answer. This gives the same ID for the same word.

df = pd.DataFrame({'ner':['emmanuel macron', 'paris', 'france', 'paris', 'france']})

df['ner_id'] = df.groupby('ner').ngroup()

df

               ner  ner_id
0  emmanuel macron       0
1            paris       2
2           france       1
3            paris       2
4           france       1

answered Nov 17, 2020 at 17:51

jtorca

1,6013 gold badges19 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user13469785 Over a year ago

Thank you very much I'm going to try on my big dataframe

Vivek Kalyanarangan · Accepted Answer · 2020-11-17 18:16:32Z

2

Use pandas.factorize-

df['ner_id'] = pd.factorize(df['ner'])[0]

Timings

@jtorca's solution -

2.12 ms ± 419 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This one -

460 µs ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

answered Nov 17, 2020 at 18:16

Vivek Kalyanarangan

9,1011 gold badge27 silver badges42 bronze badges

2 Comments

user13469785 Over a year ago

Thank you very much @Vivek Kalyanarangan it works also

jtorca Over a year ago

Never knew about pd.factorize. Thanks for the tip. Should be the preferred solution in this scenario.

Collectives™ on Stack Overflow

Pandas: How to give a unique id for a string when it appears in several rows?

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related