Modifying data frame with values from another data frame

Question

Assuming I've 2 data frames, df_a and df_b as below:

import pandas as pd

df_a

df_a = pd.DataFrame({"Letters": ['E', 'H', 'O', 'N', 'M', 'K', 'T', 'X'],
                     "Greek":['epsilon', 'eta', 'omicron', 'nu', 'mu', 
                              'kappa', 'tau', 'chi']})

     Greek Letters
0  epsilon       E
1      eta       H
2  omicron       O
3       nu       N
4       mu       M
5    kappa       K
6      tau       T
7      chi       X

df_b

df_b = pd.DataFrame({"Letters": ['Y', 'E', 'N', 'X', 'B']})
df_b["Greek"] = ""

  Letters Greek
0       Y      
1       E      
2       N      
3       X      
4       B

I want to fill in column 'Greek' in df_b using corresponding values from df_a and I can do that using nested loops as follows:

for i, row in df_a.iterrows():
    temp1 = row['Letters']
    for k, row in df_b.iterrows():
        temp2 = row['Letters']
        if temp1 == temp2:
            df_b.loc[k, "Greek"]=df_a.loc[i, "Greek"]

df_b

Letters    Greek
0       Y         
1       E  epsilon
2       N       nu
3       X      chi
4       B

First, I'm wondering if I can achieve the same result more efficiently using merge, zip, join or the other concatenation functions. Second, if I provide a matching else to the if statement, as follows:

else: df_b.loc[k, "Greek"] = float('nan')

I expected this would fill in the empty cells with NaN and not modify the other cells. Instead, I get something like:

df_b

  Letters Greek
0       Y   NaN
1       E   NaN
2       N   NaN
3       X   chi
4       B   NaN

I appreciate your thoughts on the 2 points. Any additional feedback is welcome.

EdChum · Accepted Answer · 2014-12-25 17:39:04Z

2

The fastest method would be to set 'Letters' as the index for df_a and then call map:

In [11]:

df_a = df_a.set_index('Letters')
df_b['Greek'] = df_b['Letters'].map(df_a['Greek'])
df_b
Out[11]:
  Letters    Greek
0       Y      NaN
1       E  epsilon
2       N       nu
3       X      chi
4       B      NaN

Just to demonstrate the flaw in your loop it is worth printing out what happens:

In [17]:

for i, row in df_a.iterrows():
    temp1 = row['Letters']
    for k, row in df_b.iterrows():
        temp2 = row['Letters']
        if temp1 == temp2:
            print("match i:", i, "k:", k, "letter:", temp2)
        else:
            print("no match i:", i, "k:", k, "letter:", temp2)
no match i: 0 k: 0 letter: Y
match i: 0 k: 1 letter: E
no match i: 0 k: 2 letter: N
no match i: 0 k: 3 letter: X
no match i: 0 k: 4 letter: B
no match i: 1 k: 0 letter: Y
no match i: 1 k: 1 letter: E
no match i: 1 k: 2 letter: N
no match i: 1 k: 3 letter: X
no match i: 1 k: 4 letter: B
no match i: 2 k: 0 letter: Y
no match i: 2 k: 1 letter: E
no match i: 2 k: 2 letter: N
no match i: 2 k: 3 letter: X
no match i: 2 k: 4 letter: B
no match i: 3 k: 0 letter: Y
no match i: 3 k: 1 letter: E
match i: 3 k: 2 letter: N
no match i: 3 k: 3 letter: X
no match i: 3 k: 4 letter: B
no match i: 4 k: 0 letter: Y
no match i: 4 k: 1 letter: E
no match i: 4 k: 2 letter: N
no match i: 4 k: 3 letter: X
no match i: 4 k: 4 letter: B
no match i: 5 k: 0 letter: Y
no match i: 5 k: 1 letter: E
no match i: 5 k: 2 letter: N
no match i: 5 k: 3 letter: X
no match i: 5 k: 4 letter: B
no match i: 6 k: 0 letter: Y
no match i: 6 k: 1 letter: E
no match i: 6 k: 2 letter: N
no match i: 6 k: 3 letter: X
no match i: 6 k: 4 letter: B
no match i: 7 k: 0 letter: Y
no match i: 7 k: 1 letter: E
no match i: 7 k: 2 letter: N
match i: 7 k: 3 letter: X
no match i: 7 k: 4 letter: B

So although you make the initial match, you loop over those rows again and blitz them with NaN rather than skipping them.

Timings

In [22]:

df_a = df_a.set_index('Letters')
%timeit df_b['Greek'] = df_b['Letters'].map(df_a['Greek'])

1000 loops, best of 3: 710 µs per loop
In [24]:

%%timeit 
for i, row in df_a.iterrows():
    temp1 = row['Letters']
    for k, row in df_b.iterrows():
        temp2 = row['Letters']
        if temp1 == temp2:
            df_b.loc[k, "Greek"]=df_a.loc[i, "Greek"]
100 loops, best of 3: 12.7 ms per loop

Calling map here is nearly 18 times faster, this is a vectorised function and will scale much better.

edited Dec 25, 2014 at 17:39

answered Dec 25, 2014 at 16:40

EdChum

397k204 gold badges837 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

sedeh Over a year ago

Thanks greatly. I see clearly now, esp. with the printouts. In terms of efficiency in this given example, how does the nested loops and map stack? Perhaps you can talk in terms of big O.

EdChum Over a year ago

Well your solution is O2 as you loop over 2 loops, this I suspect will be between On and nlogn, map is a vectorised function so will operate on the whole array rather than a row value at a time

sedeh Over a year ago

Thanks for the time analysis! I applied map in my actual dataset and for some reason my target column is not being filled in. I've something like this df_1 = df_1.set_index("Visit"); df_2['Result'] = df_2['Visit'].map(df_1['Result']). Not that you can glean much from that. Any random thought why the Result column in df_2 might come in as blank? Before applying map, I ensured 'Visit' columns in both data frames contain unique values using groupby.

EdChum Over a year ago

Try printing each line to see what is being returned, also does it work when running your loop code? One possibility is you have no matches or you may have done something prior to the operation but I can't guess what

sedeh Over a year ago

Oops, an extra whitespace somehow maybe into one of the rows. Once I corrected that, the values came roaring in as supposed.

Collectives™ on Stack Overflow

Modifying data frame with values from another data frame

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related