0

Assuming I've 2 data frames, df_a and df_b as below:

import pandas as pd

df_a

df_a = pd.DataFrame({"Letters": ['E', 'H', 'O', 'N', 'M', 'K', 'T', 'X'],
                     "Greek":['epsilon', 'eta', 'omicron', 'nu', 'mu', 
                              'kappa', 'tau', 'chi']})

     Greek Letters
0  epsilon       E
1      eta       H
2  omicron       O
3       nu       N
4       mu       M
5    kappa       K
6      tau       T
7      chi       X

df_b

df_b = pd.DataFrame({"Letters": ['Y', 'E', 'N', 'X', 'B']})
df_b["Greek"] = ""

  Letters Greek
0       Y      
1       E      
2       N      
3       X      
4       B      

I want to fill in column 'Greek' in df_b using corresponding values from df_a and I can do that using nested loops as follows:

for i, row in df_a.iterrows():
    temp1 = row['Letters']
    for k, row in df_b.iterrows():
        temp2 = row['Letters']
        if temp1 == temp2:
            df_b.loc[k, "Greek"]=df_a.loc[i, "Greek"]

df_b

Letters    Greek
0       Y         
1       E  epsilon
2       N       nu
3       X      chi
4       B     

First, I'm wondering if I can achieve the same result more efficiently using merge, zip, join or the other concatenation functions. Second, if I provide a matching else to the if statement, as follows:

else: df_b.loc[k, "Greek"] = float('nan')

I expected this would fill in the empty cells with NaN and not modify the other cells. Instead, I get something like:

df_b

  Letters Greek
0       Y   NaN
1       E   NaN
2       N   NaN
3       X   chi
4       B   NaN

I appreciate your thoughts on the 2 points. Any additional feedback is welcome.

0

1 Answer 1

2

The fastest method would be to set 'Letters' as the index for df_a and then call map:

In [11]:

df_a = df_a.set_index('Letters')
df_b['Greek'] = df_b['Letters'].map(df_a['Greek'])
df_b
Out[11]:
  Letters    Greek
0       Y      NaN
1       E  epsilon
2       N       nu
3       X      chi
4       B      NaN

Just to demonstrate the flaw in your loop it is worth printing out what happens:

In [17]:

for i, row in df_a.iterrows():
    temp1 = row['Letters']
    for k, row in df_b.iterrows():
        temp2 = row['Letters']
        if temp1 == temp2:
            print("match i:", i, "k:", k, "letter:", temp2)
        else:
            print("no match i:", i, "k:", k, "letter:", temp2)
no match i: 0 k: 0 letter: Y
match i: 0 k: 1 letter: E
no match i: 0 k: 2 letter: N
no match i: 0 k: 3 letter: X
no match i: 0 k: 4 letter: B
no match i: 1 k: 0 letter: Y
no match i: 1 k: 1 letter: E
no match i: 1 k: 2 letter: N
no match i: 1 k: 3 letter: X
no match i: 1 k: 4 letter: B
no match i: 2 k: 0 letter: Y
no match i: 2 k: 1 letter: E
no match i: 2 k: 2 letter: N
no match i: 2 k: 3 letter: X
no match i: 2 k: 4 letter: B
no match i: 3 k: 0 letter: Y
no match i: 3 k: 1 letter: E
match i: 3 k: 2 letter: N
no match i: 3 k: 3 letter: X
no match i: 3 k: 4 letter: B
no match i: 4 k: 0 letter: Y
no match i: 4 k: 1 letter: E
no match i: 4 k: 2 letter: N
no match i: 4 k: 3 letter: X
no match i: 4 k: 4 letter: B
no match i: 5 k: 0 letter: Y
no match i: 5 k: 1 letter: E
no match i: 5 k: 2 letter: N
no match i: 5 k: 3 letter: X
no match i: 5 k: 4 letter: B
no match i: 6 k: 0 letter: Y
no match i: 6 k: 1 letter: E
no match i: 6 k: 2 letter: N
no match i: 6 k: 3 letter: X
no match i: 6 k: 4 letter: B
no match i: 7 k: 0 letter: Y
no match i: 7 k: 1 letter: E
no match i: 7 k: 2 letter: N
match i: 7 k: 3 letter: X
no match i: 7 k: 4 letter: B

So although you make the initial match, you loop over those rows again and blitz them with NaN rather than skipping them.

Timings

In [22]:

df_a = df_a.set_index('Letters')
%timeit df_b['Greek'] = df_b['Letters'].map(df_a['Greek'])

1000 loops, best of 3: 710 µs per loop
In [24]:

%%timeit 
for i, row in df_a.iterrows():
    temp1 = row['Letters']
    for k, row in df_b.iterrows():
        temp2 = row['Letters']
        if temp1 == temp2:
            df_b.loc[k, "Greek"]=df_a.loc[i, "Greek"]
100 loops, best of 3: 12.7 ms per loop

Calling map here is nearly 18 times faster, this is a vectorised function and will scale much better.

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks greatly. I see clearly now, esp. with the printouts. In terms of efficiency in this given example, how does the nested loops and map stack? Perhaps you can talk in terms of big O.
Well your solution is O2 as you loop over 2 loops, this I suspect will be between On and nlogn, map is a vectorised function so will operate on the whole array rather than a row value at a time
Thanks for the time analysis! I applied map in my actual dataset and for some reason my target column is not being filled in. I've something like this df_1 = df_1.set_index("Visit"); df_2['Result'] = df_2['Visit'].map(df_1['Result']). Not that you can glean much from that. Any random thought why the Result column in df_2 might come in as blank? Before applying map, I ensured 'Visit' columns in both data frames contain unique values using groupby.
Try printing each line to see what is being returned, also does it work when running your loop code? One possibility is you have no matches or you may have done something prior to the operation but I can't guess what
Oops, an extra whitespace somehow maybe into one of the rows. Once I corrected that, the values came roaring in as supposed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.