0

I am trying to figure out the most efficient way to search a data frame in Pandas with a list (dataframe) of other values without using brute force methods. Is there a way to vectorize it? I know I can for loop each element of the list (or dataframe) and extract the data using the loc method, but was hoping for something faster. I have a data frame with 1 million rows and I need to search within it to extract the index of 600,000 rows.

Example:

import pandas as pd
import numpy as np

df = pd.DataFrame({'WholeList': np.round(1000000*(np.random.rand(1000000)),0)})
df2 = pd.DataFrame({'ThingsToFind': np.arange(50000)+50000})
df.loc[1:10,:]
#Edited, now that I think about it, the 'arange' method would have been better to populate the arrays.

I want the most efficient way to get the index of df2 in df, where it exists in df.

Thanks!

6
  • So, the output would be of length 1 million? Commented Apr 5, 2017 at 22:06
  • Also, what to output if there's isn't a match of df2 in df? Commented Apr 5, 2017 at 22:12
  • Did you try to use the isin() DataFrame method? Commented Apr 5, 2017 at 22:18
  • Either length would be ok now that I think about it. Commented Apr 6, 2017 at 1:37
  • @Andrew L I've mainly tried brute forcing through the loc method, but I assumed this is the most time intensive way to do it. Commented Apr 6, 2017 at 1:47

3 Answers 3

1

Pandas dataframes have an isin() method that works really well:

df[df.WholeList.isin(df2.ThingsToFind)]

It seems reasonably performant on my MBP:

CPU times: user 3 µs, sys: 5 µs, total: 8 µs
Wall time: 11 µs
Sign up to request clarification or add additional context in comments.

4 Comments

But, we need to get the indexes of df2 corresponding to the matches, right?
I guess I don't understand what you mean. There's no explicit index in df2. You looking for the row number index? um, that's simply df[df.WholeList.isin(df2.ThingsToFind)].index
@Divaker I would say that as long as I can easily have it where WholeList(index our function would provide)=ThingsToFind, then I'd be happy. I'm thinking of a MATLAB command that I love and trying to implement it in Python. Sorry if this is a newbie question, but I'm only in month 2 of the language.
I'll give isin a try. Thanks!
0

Here's an approach with np.searchsorted as it seems the second dataframe has its elements sorted and unique -

def find_index(a,b, invalid_specifier = -1):
    idx = np.searchsorted(b,a)
    idx[idx==b.size] = 0
    idx[b[idx] != a] = invalid_specifier
    return idx

def process_dfs(df, df2):
    a = df.WholeList.values.ravel()
    b = df2.ThingsToFind.values.ravel()
    return find_index(a,b, invalid_specifier=-1)

Sample run on arrays -

In [200]: a
Out[200]: array([ 3,  5,  8,  4,  3,  2,  5,  2, 12,  6,  3,  7])

In [201]: b
Out[201]: array([2, 3, 5, 6, 7, 8, 9])

In [202]: find_index(a,b, invalid_specifier=-1)
Out[202]: array([ 1,  2,  5, -1,  1,  0,  2,  0, -1,  3,  1,  4])

Sample run on dataframes -

In [188]: df
Out[188]: 
    WholeList
0           3
1           5
2           8
3           4
4           3
5           2
6           5
7           2
8          12
9           6
10          3
11          7

In [189]: df2
Out[189]: 
   ThingsToFind
0             2
1             3
2             5
3             6
4             7
5             8
6             9

In [190]: process_dfs(df, df2)
Out[190]: array([ 1,  2,  5, -1,  1,  0,  2,  0, -1,  3,  1,  4])

2 Comments

Thanks! This is an interesting approach.
This worked beautifully. isin() didn't give me what I wanted, but this was brilliant. Thanks!
0

I agree with @JDLong - IMO Pandas is pretty fast:

In [49]: %timeit df[df.WholeList.isin(df2.ThingsToFind)]
1 loop, best of 3: 819 ms per loop

In [50]: %timeit df.loc[df.WholeList.isin(df2.ThingsToFind)]
1 loop, best of 3: 814 ms per loop

In [51]: %timeit df.query("WholeList in @df2.ThingsToFind")
1 loop, best of 3: 837 ms per loop

1 Comment

Thanks. I assumed there were other approaches than the brute force + loc method. I'll give this a try.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.