2

I have the two following dataframes that I want to merge.

df1:
     id   time                  station
0     a   22.08.2017 12:00:00   A1
1     b   22.08.2017 12:00:00   A3
2     a   22.08.2017 13:00:00   A2
...

pivot:
      station               A1     A2     A3
0     time
1     22.08.2017 12:00:00   10     12     11
2     22.08.2017 13:00:00   9      7      3
3     22.08.2017 14:00:00   2      3      4
4     22.08.2017 15:00:00   3      2      7
...

it should look like:

merge:

     id   time                  station   value
0     a   22.08.2017 12:00:00   A1        10
1     b   22.08.2017 12:00:00   A3        11
2     a   22.08.2017 13:00:00   A2        7
...

Now I want to add a column in the data frame with the right value from the pivot table. I failed including the column labels for the merge. I constructed something like that, but it does not work:

merge = pd.merge(df1, pivot, how="left", left_on=["time", "station"], right_on=["station", pivot.columns])

Any help?

EDIT:

As advised, instead of the pivot table I tried to use the following data:

df2:
time                 station   value
22.08.2017 12:00:00  A1        10
22.08.2017 12:00:00  A2        12
22.08.2017 12:00:00  A3        11
              ...
22.08.2017 13:00:00  A1        9
22.08.2017 13:00:00  A2        7
22.08.2017 13:00:00  A3        3

The table contains about 1300 different stations for every timestamp. All in all I have more than 115.000.000 rows. My df1 have 5.000.000 rows.

Now I tried to merge df1.head(100) and df2, but in the result all values are nan. Therefore I used this:

merge = pd.merge(df1.head(100), df2, how="left", on=["time", "station"])

Another problem is that the merge takes a few minutes so that I expect the whole df1 will take several days.

5
  • Can you post how you got to df2 with sample data? Commented Aug 22, 2017 at 16:15
  • What do you mean with df2? If you apply to the data frame I want to reach, I find out, which time and station belongs to the first id. Then I compare with the pivot data frame and get the value for the same time and station and continue with the next row. Therefore I created a for-loop, but it is not so fast. That's the reason why I want to do it by merging the data. Commented Aug 22, 2017 at 16:23
  • Sorry I misread- pivot dataframe. Do you have sample data to recreate this? I'm wondering if there is a better/easier way to pivot this. Commented Aug 22, 2017 at 16:26
  • Before I had all values below each other. Then I used pandas.pivot_table() to aggregate all times and switch the stations into the column names. Commented Aug 22, 2017 at 16:33
  • Then you should just perform your merge on the dataframe before your pivot as I mentionned in my answer below. Commented Aug 22, 2017 at 16:34

1 Answer 1

1

I guess you got the dataframe pivot using either pivot or pivot_table in pandas, if you can perform the merge using the dataframe you had before the pivot it should work just fine.

Otherwise you will have to reverse the pivot using melt before merging:

melt = pd.concat([pivot[['time']],pivot[['A1']].melt()],axis = 1)
melt = pd.concat([melt,pd.concat([pivot[['time']],pivot[['A2']].melt()],axis = 1)])
melt = pd.concat([melt,pd.concat([pivot[['time']],pivot[['A3']].melt()],axis = 1)])
melt.columns = ['time','station','value']

Then just perform a merge like you expected it:

my_df.merge(melt,on = ['time','station'])

    id  time    station value
0   a   time1   A1      10
1   b   time1   A3      11
2   a   time2   A2      7

EDIT:

If your dataframes are as big as in your edit, you indeed have to perform the merges on chunks of them. You could try to reduce it to chunk both your dataframes.

First, sort your df1 in order to have only close values of time:

df1.sort_values('time',inplace = True)

Then you chunk it, chunk the second dataframe in the way you are sure to have all the rows you might need, and then merge those chunks:

chunk1 = df1.head(100)
chunk2 = df2.loc[df2.time.between(chunk1.time.min(),chunk1.time.max())]
merge = chunk1.merge(chunk2,on = ['time','station'],how = 'left')
Sign up to request clarification or add additional context in comments.

7 Comments

That could be a good solution. I will try it tomorrow. Maybe there will be a "time" problem, because the original data frame contains a few million rows. Therefore I hoped that the pivot table is more useful for fast merging.
If your problem is about computational problem, maybe consider chunking both your dataframes by time section and merging ond those time chunks before concatenating. This way you will perform several smaller merges. It will ba a slightly longer code, but will save you a lot of time.
I tried to merge the df with the original data but the values in the result are nan. I expanded my description above. Do you know why there aren't the rigth values?
I made an edit to my answer to match yours. Nevertheless this doesn't address the NaN problem. Do the types of your ['time','station'] columns are the same in df1 and df2?
Yes, you were right! I saved d1.head(100) into a extra variable and forgot to change the datatype of the time column to datetime for this variable. Then it is working, thank you! When I try to merge the whole df1, should I create the chunks with a loop? For example to create a chunk1 and chunk2 every 1000 rows, merge it and append it to the 'merge'-df? Or is there a faster solution?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.