The following scenario is given.
I have 2 dataframes called orders and customers.
I want to look where the CustomerID from the OrderDataFrame is in the LinkedCustomer column of the Customer Dataframe. The LinkedCustomers field is an array of CustomerIds.
The orders dataframe contains approximately 5.800.000 items. The customer dataframe contains approximately 180 000 items.
I am looking for a way to optimize the following code, because this code runs but is very slow. How can I speed this up?
# demo data -- In the real scenario this data was read from csv-/json files.
orders = pd.DataFrame({'custId': [1, 2, 3, 4], 'orderId': [2,3,4,5]})
customers = pd.DataFrame({'id':[5,6,7], 'linkedCustomers': [{1,2}, {4,5,6}, {3, 7, 8, 9}]})
def getMergeCustomerID(row):
customerOrderId = row['custId']
searchMasterCustomer = customers[customers['linkedCustomers'].str.contains(str(customerOrderId))]
searchMasterCustomer = searchMasterCustomer['id']
if len(searchMasterCustomer) > 0:
return searchMasterCustomer
else:
return customerOrderId
orders['newId'] = orders.apply(lambda x: getMergeCustomerID(x), axis=1)
# expected result
custId orderId newId
1 2 5
2 3 5
3 4 7
4 5 6
pd.mergethe two tables and be done with it