1

I have a pandas dataframe looking like this:

ner_id  art_id  ner
0       0      emmanuel macron
1       0      paris
2       0      france
3       1      paris
4       0      france

I would like to change the column 'ner_id'.

For example, paris appears in the article with id 0 and also 1 (see art_id column).

I would like to only change the column ner_id and give a unique id for paris and not a different id.

I want to do this in the column everytime a word is repeating in the column and give the repeating word the same id.

How can I do it ?

Expected output:

ner_id  art_id  ner
    0       0      emmanuel macron
    1       0      paris
    2       0      france
    1       1      paris
    2       0      france

I would to give first id of the term everytime a term is being repeated in the next rows.

9
  • ner_id already is unique. Can you include your expected output for this example Commented Nov 17, 2020 at 17:43
  • I think what you want is to group by ner and assign ids by these groups. If your data frame is called df, you could try df['ner_id'] = df.groupby('ner').ngroup(). Commented Nov 17, 2020 at 17:45
  • @PaulH I updated my post Commented Nov 17, 2020 at 17:48
  • Why can't the ner column itself serve as the unique identifier? Commented Nov 17, 2020 at 17:48
  • 1
    I've included my comment in the solution below. If you can explain why it doesn't work for you I can update it. Commented Nov 17, 2020 at 17:53

2 Answers 2

2

I'll just put into an answer. This gives the same ID for the same word.

df = pd.DataFrame({'ner':['emmanuel macron', 'paris', 'france', 'paris', 'france']})

df['ner_id'] = df.groupby('ner').ngroup()

df
               ner  ner_id
0  emmanuel macron       0
1            paris       2
2           france       1
3            paris       2
4           france       1
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much I'm going to try on my big dataframe
2

Use pandas.factorize-

df['ner_id'] = pd.factorize(df['ner'])[0]

Timings

@jtorca's solution -

2.12 ms ± 419 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This one -

460 µs ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

2 Comments

Thank you very much @Vivek Kalyanarangan it works also
Never knew about pd.factorize. Thanks for the tip. Should be the preferred solution in this scenario.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.