5

I'm wondering if there is a more efficient way to do an "index & match" type function that is popular in excel. For example - given two pandas DataFrames, update the df_1 with information found in df_2:

import pandas as pd

df_1 = pd.DataFrame({'num_a':[1, 2, 3, 4, 5],
                     'num_b':[2, 4, 1, 2, 3]})    
df_2 = pd.DataFrame({'num':[1, 2, 3, 4, 5],
                     'name':['a', 'b', 'c', 'd', 'e']})

I'm working with data sets that have ~80,000 rows in both df_1 and df_2 and my goal is to create two new columns in df_1, "name_a" and "name_b".

Below is the most efficient method that I could come up with. There has to be a better way!

name_a = []
name_b = []
for i in range(len(df_1)):

    name_a.append(df_2.name.iloc[df_2[
                  df_2.num == df_1.num_a.iloc[i]].index[0]])
    name_b.append(df_2.name.iloc[df_2[
                  df_2.num == df_1.num_b.iloc[i]].index[0]])

df_1['name_a'] = name_a
df_1['name_b'] = name_b

Resulting in:

>>> df_1.head()
   num_a  num_b name_a name_b
0      1      2      a      b
1      2      4      b      d
2      3      1      c      a
3      4      2      d      b
4      5      3      e      c

3 Answers 3

5

High Level

  • Create a dictionary to use in a replace
  • replace, rename columns, and join

m = dict(zip(
    df_2.num.values.tolist(),
    df_2.name.values.tolist()
))

df_1.join(
    df_1.replace(m).rename(
        columns=lambda x: x.replace('num', 'name')
    )
)

   num_a  num_b name_a name_b
0      1      2      a      b
1      2      4      b      d
2      3      1      c      a
3      4      2      d      b
4      5      3      5      c

Breakdown

replace with a dictionary should be pretty quick. There are bunch of ways to build a dictionary form df_2. As a matter of fact we could have used a pd.Series. I chose to build with dict and zip because I find that it's faster.

Building m

Option 1

m = df_2.set_index('num').name

Option 2

m = df_2.set_index('num').name.to_dict()

Option 3

m = dict(zip(df_2.num, df_2.name))

Option 4 (My Choice)

m = dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist()))

m build times

1000 loops, best of 3: 325 µs per loop
1000 loops, best of 3: 376 µs per loop
10000 loops, best of 3: 32.9 µs per loop
100000 loops, best of 3: 10.4 µs per loop

%timeit df_2.set_index('num').name
%timeit df_2.set_index('num').name.to_dict()
%timeit dict(zip(df_2.num, df_2.name))
%timeit dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist()))

Replacing num

Again, we have choices, here are a few and their times.

%timeit df_1.replace(m)
%timeit df_1.applymap(lambda x: m.get(x, x))
%timeit df_1.stack().map(lambda x: m.get(x, x)).unstack()

1000 loops, best of 3: 792 µs per loop
1000 loops, best of 3: 959 µs per loop
1000 loops, best of 3: 925 µs per loop

I choose...

df_1.replace(m)

  num_a num_b
0     a     b
1     b     d
2     c     a
3     d     b
4     5     c

Rename columns

df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name'))

  name_a name_b   <-- note the column name change
0      a      b
1      b      d
2      c      a
3      d      b
4      5      c

Join

df_1.join(df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name')))

   num_a  num_b name_a name_b
0      1      2      a      b
1      2      4      b      d
2      3      1      c      a
3      4      2      d      b
4      5      3      5      c
Sign up to request clarification or add additional context in comments.

Comments

3

I think there's a more straightforward solution than those already offered. Since you mentioned Excel, this is a basic vlookup. You can simulate this in pandas by using Series.map.

name_map = dict(df_2.set_index('num').name)

df_1['name_a'] = df_1.num_a.map(name_map)
df_1['name_b'] = df_1.num_b.map(name_map)

df_1

   num_a  num_b name_a name_b
0      1      2      a      b
1      2      4      b      d
2      3      1      c      a
3      4      2      d      b
4      5      3      e      c

All we do is convert df_2 to a dict with 'num' as the keys. The map function looks up each value from a df_1 column in the dict and returns the corresponding letter. No complicated indexing required.

Comments

0

Just try a conditional statement:

import pandas as pd
import numpy as np
df_1 = pd.DataFrame({'num_a':[1, 2, 3, 4, 5],
                     'num_b':[2, 4, 1, 2, 3]})    
df_2 = pd.DataFrame({'num':[1, 2, 3, 4, 5],
                     'name':['a', 'b', 'c', 'd', 'e']})
df_1["name_a"] = df_2["num_b"]
df_1["name_b"] = np.array(df_1["name_a"][df_1["num_b"]-1]) 
print(df_1)

   num_a  num_b name_a name_b
0      1      2      a      b
1      2      4      b      d
2      3      1      c      a
3      4      2      d      b
4      5      3      e      c

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.