3

this is my first question on StackOverflow, so please pardon if I am not clear enough. I usually find my answers here but this time I had no luck. Maybe I am being dense, but here we go.

I have two pandas dataframes formatted as follows

df1

+------------+-------------+
| References | Description |
+------------+-------------+
| 1,2        | Descr 1     |
| 3          | Descr 2     |
| 2,3,5      | Descr 3     |
+------------+-------------+

df2

+--------+--------------+
| Ref_ID |   ShortRef   |
+--------+--------------+
|      1 | Smith (2006) |
|      2 | Mike (2009)  |
|      3 | John (2014)  |
|      4 | Cole (2007)  |
|      5 | Jill (2019)  |
|      6 | Tom (2007)   |
+--------+--------------+

Basically, Ref_ID in df2 contains IDs that form the string contained in the field References in df1

What I would like to do is to replace values in the References field in df1 so it looks like this:

+-------------------------------------+-------------+
|             References              | Description |
+-------------------------------------+-------------+
| Smith (2006); Mike (2009)           | Descr 1     |
| John (2014)                         | Descr 2     |
| Mike (2009);John (2014);Jill (2019) | Descr 3     |
+-------------------------------------+-------------+

So far, I had to deal with columns and IDs with a 1-1 relationship, and this works perfectly Pandas - Replacing Values by Looking Up in an Another Dataframe

But I cannot get my mind around this slightly different problem. The only solution I could think of is to re-iterate a for and if cycles that compare every string of df1 to df2 and make the substitution.

This would be, I am afraid, very slow as I have ca. 2000 unique Ref_IDs and I have to repeat this operation in several columns similar to the References one.

Anyone is willing to point me in the right direction?

Many thanks in advance.

1
  • EDIT: thanks for the hints, I am trying them out. One thing I am now struggling with is that some cells within "References" are empty. Commented Jan 7, 2020 at 10:19

3 Answers 3

3

Let's try this:

df1 = pd.DataFrame({'Reference':['1,2','3','1,3,5'], 'Description':['Descr 1', 'Descr 2', 'Descr 3']})
df2 = pd.DataFrame({'Ref_ID':[1,2,3,4,5,6], 'ShortRef':['Smith (2006)',
                                                       'Mike (2009)',
                                                       'John (2014)',
                                                       'Cole (2007)',
                                                       'Jill (2019)',
                                                       'Tom (2007)']})

df1['Reference2'] = (df1['Reference'].str.split(',')
                                     .explode()
                                     .map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
                                             .set_index('Ref_ID')['ShortRef'])
                                     .groupby(level=0).agg(list))

Output:

  Reference Description                                Reference2
0       1,2     Descr 1               [Smith (2006), Mike (2009)]
1         3     Descr 2                             [John (2014)]
2     1,3,5     Descr 3  [Smith (2006), John (2014), Jill (2019)]

@Datanovice thanks for the update.

df1['Reference2'] = (df1['Reference'].str.split(',')
                                     .explode()
                                     .map(df2.assign(Ref_ID=df2.Ref_ID.astype(str))
                                             .set_index('Ref_ID')['ShortRef'])
                                     .groupby(level=0).agg(';'.join))

Output:

  Reference Description                            Reference2
0       1,2     Descr 1              Smith (2006);Mike (2009)
1         3     Descr 2                           John (2014)
2     1,3,5     Descr 3  Smith (2006);John (2014);Jill (2019)
Sign up to request clarification or add additional context in comments.

9 Comments

One gotcha... is checking the dtypes between df1 Reference and df2 Reference.
doesnt appear OP wants the [...] around references, is there a way to get rid of that?
df['Reference2'] = df["References"].str.split(",").explode().astype(int).map( df2.set_index("Ref_ID")["ShortRef"] ).groupby(level=0).agg(';'.join) i think to sort out the dtypes ;)
This solution works perfectly! thank you very muck, I will add a pingback to this page into my code.
After implementing the solution, I noticed that in some cases values are misplaced. E.g. in the case above, instead of finding only "John (2014)" related to Reference ID 3, I also find "Smith (2006)". But this happens only at specific records, e.g. other rows with Reference ID = 3 are substituted ok. Weird behavior...
|
3

you can use some list comprehension and dict lookups and I dont think this will be too slow

First, making a fast-to-access mapping for id to short_ref

mapping_dict = df2.set_index('Ref_ID')['ShortRef'].to_dict()

Then, lets split references by commas

df1_values = [v.split(',') for v in df1['References']]

Finally, we can iterate over and do dictionary lookups, before concatenating back to strings

df1['References'] = pd.Series([';'.join([mapping_dict[v] for v in values]) for values in df1_values])

Is this usable or is it too slow?

2 Comments

this is really good, but you can make use of explode and str.split to make for fewer lines of code.
if speed is a concern this may be faster though
1

Another solution is using str.get_dummies and dot

df3 = (df1.set_index('Description').Reference.str.get_dummies(',')
          .reindex(columns=df2.Ref_ID.astype(str).values, fill_value=0))
df_final = (df3.dot(df2.ShortRef.values+';').str.strip(';').rename('References')
               .reset_index())

Out[462]:
  Description                           References
0     Descr 1             Smith (2006);Mike (2009)
1     Descr 2                          John (2014)
2     Descr 3  Mike (2009);John (2014);Jill (2019)

1 Comment

This one works as well, and the error I mention above does not occur.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.