Replace comma-separated values in a dataframe with values from another dataframe

Question

this is my first question on StackOverflow, so please pardon if I am not clear enough. I usually find my answers here but this time I had no luck. Maybe I am being dense, but here we go.

I have two pandas dataframes formatted as follows

df1

+------------+-------------+
| References | Description |
+------------+-------------+
| 1,2        | Descr 1     |
| 3          | Descr 2     |
| 2,3,5      | Descr 3     |
+------------+-------------+

df2

+--------+--------------+
| Ref_ID |   ShortRef   |
+--------+--------------+
|      1 | Smith (2006) |
|      2 | Mike (2009)  |
|      3 | John (2014)  |
|      4 | Cole (2007)  |
|      5 | Jill (2019)  |
|      6 | Tom (2007)   |
+--------+--------------+

Basically, Ref_ID in df2 contains IDs that form the string contained in the field References in df1

What I would like to do is to replace values in the References field in df1 so it looks like this:

+-------------------------------------+-------------+
|             References              | Description |
+-------------------------------------+-------------+
| Smith (2006); Mike (2009)           | Descr 1     |
| John (2014)                         | Descr 2     |
| Mike (2009);John (2014);Jill (2019) | Descr 3     |
+-------------------------------------+-------------+

So far, I had to deal with columns and IDs with a 1-1 relationship, and this works perfectly Pandas - Replacing Values by Looking Up in an Another Dataframe

But I cannot get my mind around this slightly different problem. The only solution I could think of is to re-iterate a for and if cycles that compare every string of df1 to df2 and make the substitution.

This would be, I am afraid, very slow as I have ca. 2000 unique Ref_IDs and I have to repeat this operation in several columns similar to the References one.

Anyone is willing to point me in the right direction?

Many thanks in advance.

EDIT: thanks for the hints, I am trying them out. One thing I am now struggling with is that some cells within "References" are empty. — Alessio Rovere
– Alessio Rovere, Commented Jan 7, 2020 at 10:19

Scott Boston · Accepted Answer · 2020-01-06 18:47:48Z

3

Let's try this:

df1 = pd.DataFrame({'Reference':['1,2','3','1,3,5'], 'Description':['Descr 1', 'Descr 2', 'Descr 3']})
df2 = pd.DataFrame({'Ref_ID':[1,2,3,4,5,6], 'ShortRef':['Smith (2006)',
                                                       'Mike (2009)',
                                                       'John (2014)',
                                                       'Cole (2007)',
                                                       'Jill (2019)',
                                                       'Tom (2007)']})

df1['Reference2'] = (df1['Reference'].str.split(',')
                                     .explode()
                                     .map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
                                             .set_index('Ref_ID')['ShortRef'])
                                     .groupby(level=0).agg(list))

Output:

  Reference Description                                Reference2
0       1,2     Descr 1               [Smith (2006), Mike (2009)]
1         3     Descr 2                             [John (2014)]
2     1,3,5     Descr 3  [Smith (2006), John (2014), Jill (2019)]

@Datanovice thanks for the update.

df1['Reference2'] = (df1['Reference'].str.split(',')
                                     .explode()
                                     .map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
                                             .set_index('Ref_ID')['ShortRef'])
                                     .groupby(level=0).agg(';'.join))

Output:

  Reference Description                            Reference2
0       1,2     Descr 1              Smith (2006);Mike (2009)
1         3     Descr 2                           John (2014)
2     1,3,5     Descr 3  Smith (2006);John (2014);Jill (2019)

edited Jan 6, 2020 at 18:47

answered Jan 6, 2020 at 18:36

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Scott Boston Over a year ago

One gotcha... is checking the dtypes between df1 Reference and df2 Reference.

hchw Over a year ago

doesnt appear OP wants the [...] around references, is there a way to get rid of that?

Umar.H Over a year ago

df['Reference2'] = df["References"].str.split(",").explode().astype(int).map(     df2.set_index("Ref_ID")["ShortRef"] ).groupby(level=0).agg(';'.join)

i think to sort out the dtypes ;)

Alessio Rovere Over a year ago

This solution works perfectly! thank you very muck, I will add a pingback to this page into my code.

Alessio Rovere Over a year ago

After implementing the solution, I noticed that in some cases values are misplaced. E.g. in the case above, instead of finding only "John (2014)" related to Reference ID 3, I also find "Smith (2006)". But this happens only at specific records, e.g. other rows with Reference ID = 3 are substituted ok. Weird behavior...

|

hchw · Accepted Answer · 2020-01-06 18:35:45Z

3

you can use some list comprehension and dict lookups and I dont think this will be too slow

First, making a fast-to-access mapping for id to short_ref

mapping_dict = df2.set_index('Ref_ID')['ShortRef'].to_dict()

Then, lets split references by commas

df1_values = [v.split(',') for v in df1['References']]

Finally, we can iterate over and do dictionary lookups, before concatenating back to strings

df1['References'] = pd.Series([';'.join([mapping_dict[v] for v in values]) for values in df1_values])

Is this usable or is it too slow?

answered Jan 6, 2020 at 18:35

hchw

1,4369 silver badges15 bronze badges

2 Comments

Umar.H Over a year ago

this is really good, but you can make use of explode and str.split to make for fewer lines of code.

hchw Over a year ago

if speed is a concern this may be faster though

Andy L. · Accepted Answer · 2020-01-06 19:15:40Z

1

Another solution is using str.get_dummies and dot

df3 = (df1.set_index('Description').Reference.str.get_dummies(',')
          .reindex(columns=df2.Ref_ID.astype(str).values, fill_value=0))
df_final = (df3.dot(df2.ShortRef.values+';').str.strip(';').rename('References')
               .reset_index())

Out[462]:
  Description                           References
0     Descr 1             Smith (2006);Mike (2009)
1     Descr 2                          John (2014)
2     Descr 3  Mike (2009);John (2014);Jill (2019)

answered Jan 6, 2020 at 19:15

Andy L.

25.3k4 gold badges20 silver badges30 bronze badges

1 Comment

Alessio Rovere Over a year ago

This one works as well, and the error I mention above does not occur.

Collectives™ on Stack Overflow

Replace comma-separated values in a dataframe with values from another dataframe

3 Answers 3

9 Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

9 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related