Python Pandas Dataframe - Optimize Search for id in another Dataframe

Question

The following scenario is given.

I have 2 dataframes called orders and customers.

I want to look where the CustomerID from the OrderDataFrame is in the LinkedCustomer column of the Customer Dataframe. The LinkedCustomers field is an array of CustomerIds.

The orders dataframe contains approximately 5.800.000 items. The customer dataframe contains approximately 180 000 items.

I am looking for a way to optimize the following code, because this code runs but is very slow. How can I speed this up?


# demo data -- In the real scenario this data was read from csv-/json files.
orders = pd.DataFrame({'custId': [1, 2, 3, 4], 'orderId': [2,3,4,5]})
customers = pd.DataFrame({'id':[5,6,7], 'linkedCustomers': [{1,2}, {4,5,6}, {3, 7, 8, 9}]})


def getMergeCustomerID(row):
    customerOrderId = row['custId']
    searchMasterCustomer = customers[customers['linkedCustomers'].str.contains(str(customerOrderId))]
    searchMasterCustomer = searchMasterCustomer['id'] 
    if len(searchMasterCustomer) > 0:      
        return searchMasterCustomer
    else:
        return customerOrderId

orders['newId'] = orders.apply(lambda x: getMergeCustomerID(x), axis=1)


# expected result
  custId  orderId  newId
   1        2        5
   2        3        5
   3        4        7
   4        5        6

@ErikK I have a similar problem: I have to merge two tables: each Id of the first against the largest among the smaller Ids of the second table. Do you know any way to do it in linear/reasonable time? [Do you prefer me to open a separate question?] In my case Ids are timestamp and are ordered, I want to match my table 1 with the last measurement realized and stored on table 2. The naive match is O(N*M), a decent algorithm should be O(N+M) — jimifiki
– jimifiki, Commented Jan 23, 2020 at 9:57
@jimifiki if these tables originate in a databse why not perform the oin there and thenexport to pandas. pandas is super slow compared to databses already optimised for all this kind of work and written in C — Erik K
– Erik K, Commented Jan 23, 2020 at 11:42

jimifiki · Accepted Answer · 2020-01-23 09:45:42Z

1

I think that in some circumstances this approach can solve your problem: Build a dictionary first,

myDict = {}
for i,j in customers.iterrows():
    for j2 in j[1]:
        myDict[j2]=j[0]

then use the dictionary to create the new column:

orders['newId'] = [myDict[i] for i in orders['custId']]

IMO even though this can solve your problem (speed up your program) this is not the most generic solution. Better answers are welcome!

answered Jan 23, 2020 at 9:45

jimifiki

5,5832 gold badges42 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user8606929 Over a year ago

thanks for your help. i modified your code a bit. python orders['newId'] = [myDict[i] if i in myDict.keys() else None for i in orders['id']] otherwise i got an error because value length was not equal

Collectives™ on Stack Overflow

Python Pandas Dataframe - Optimize Search for id in another Dataframe

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related