3

Hello trying to merge two data frames and sum visit counts by date and upc.

  1. Transaction data (date,upc,sales) 200k rows x 3 columns
  2. Visits counts(date, upc, visit count) 2 million+ rows x 3 columns

I’ve tried this

df3 = pandas.merge(df1,df2, on = ['upc','date'], how = 'left') 

Result: merge executes but it does not sum up by date or upc

I also tried

df3 = pandas.merge(df1,df2, left_on = ['date'], right_on ['upc'] how = 'left')

and that didn’t work.

df3 = pandas.merge(df1,df2, left_on = ['date','upc'], right_on ['date','upc'] how = 'left')

and that didn’t work.

I also tried

df3 = pandas.merge(df1,df2, on = ['date'], how = 'left')

and I kept returning an error message. Based on the error message it looked like I needed to convert one of the dates in the data frames to pandas dtype.

I made that change and returned the same results as my first try. The merge worked but it did not sum up the results. I tried converting both dates in both data frames to astype(str) and that didn’t work. I learned if both dates have the same date format dtype or astype(str) I return a memory error message. Merge would fail.

I was successful with merging using the upc only but this creates an issue in my data, I return duplicate visit numbers because a upc is repeated in the transaction due to the date column.

End of the day what I need is something similar to a sumif function in excel.

I need to combine the two data sets by summarizing the total visits by each upc for each day and keeping transaction data unchanged or left join in terms of sql

Sample data

df1
  Date         upc       sales
0 09/01/2016   A01234    1000
1 09/02/2016   A01234    500
2 09/10/2016   A56789    1200

df2
  Date         upc         visits
0 09/01/2016   A01234      10
1 09/02/2016   A01234      25
2 09/05/2016   A56789      26
3 09/10/2016   A56789      32


df3
  Date         upc       sales   visits
0 09/01/2016   A01234    1000    10
1 09/02/2016   A01234    500     25
2 09/10/2016   A56789    1200    32

Wondering if pandasql package is what I need to use. Any help is appreciated

1
  • 1
    A simple df1.merge(df2, on=['Date', 'upc']) would work, would it not? Commented Dec 30, 2017 at 8:49

1 Answer 1

4

The first merge statement you perform should get you halfway there, but unfortunately, it's the second half of a two-step process. It sounds like you want to merge the sales data onto the visits data after summing the visits by Date/upc. You actually have to do a sum first (the merge command does not do this by itself). Try:

df2_sum = df2.groupby(["Date", "upc"])["visits"].sum().reset_index()

Then left-merge this onto the sales data:

df3 = pd.merge(df1, df2_sum, on=["Date", "upc"], how="left")
Sign up to request clarification or add additional context in comments.

4 Comments

The visits data is already summarized. It’s a qry coming from big query the outputs visits counts for each day and for each upc that had a visit. How would a groupby change the format ?
Hi PaSTE I tried this second method, script runs but the output was not successfully. Visits columns merges over but the column is blank. There are no values.
So I found a solution by using pandasql. Combining the data and summing the values works using the group by clause.
Meant to add more notes, the issue is with the dates. Not a 100% sure why but when you convert both dates for both dfs to panda date time and then use pandasql the merge works and the values for visits summarize. The only caveat is the output, after the tables are merged, the date field shows the date and time stamp of 00:00:00. Excel easily converts this into a date. Just wanted to make note of that and see if anyone knows how to manipulate the dates so they output in a simple date format i.e. 09/02/2016

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.