3

The following scenario is given.

I have 2 dataframes called orders and customers.

I want to look where the CustomerID from the OrderDataFrame is in the LinkedCustomer column of the Customer Dataframe. The LinkedCustomers field is an array of CustomerIds.

The orders dataframe contains approximately 5.800.000 items. The customer dataframe contains approximately 180 000 items.

I am looking for a way to optimize the following code, because this code runs but is very slow. How can I speed this up?


# demo data -- In the real scenario this data was read from csv-/json files.
orders = pd.DataFrame({'custId': [1, 2, 3, 4], 'orderId': [2,3,4,5]})
customers = pd.DataFrame({'id':[5,6,7], 'linkedCustomers': [{1,2}, {4,5,6}, {3, 7, 8, 9}]})


def getMergeCustomerID(row):
    customerOrderId = row['custId']
    searchMasterCustomer = customers[customers['linkedCustomers'].str.contains(str(customerOrderId))]
    searchMasterCustomer = searchMasterCustomer['id'] 
    if len(searchMasterCustomer) > 0:      
        return searchMasterCustomer
    else:
        return customerOrderId

orders['newId'] = orders.apply(lambda x: getMergeCustomerID(x), axis=1)


# expected result
  custId  orderId  newId
   1        2        5
   2        3        5
   3        4        7
   4        5        6

3
  • why not just pd.merge the two tables and be done with it Commented Jan 23, 2020 at 9:43
  • @ErikK I have a similar problem: I have to merge two tables: each Id of the first against the largest among the smaller Ids of the second table. Do you know any way to do it in linear/reasonable time? [Do you prefer me to open a separate question?] In my case Ids are timestamp and are ordered, I want to match my table 1 with the last measurement realized and stored on table 2. The naive match is O(N*M), a decent algorithm should be O(N+M) Commented Jan 23, 2020 at 9:57
  • @jimifiki if these tables originate in a databse why not perform the oin there and thenexport to pandas. pandas is super slow compared to databses already optimised for all this kind of work and written in C Commented Jan 23, 2020 at 11:42

1 Answer 1

1

I think that in some circumstances this approach can solve your problem: Build a dictionary first,

myDict = {}
for i,j in customers.iterrows():
    for j2 in j[1]:
        myDict[j2]=j[0]

then use the dictionary to create the new column:

orders['newId'] = [myDict[i] for i in orders['custId']]

IMO even though this can solve your problem (speed up your program) this is not the most generic solution. Better answers are welcome!

Sign up to request clarification or add additional context in comments.

1 Comment

thanks for your help. i modified your code a bit. python orders['newId'] = [myDict[i] if i in myDict.keys() else None for i in orders['id']] otherwise i got an error because value length was not equal

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.