3

I have two Pandas Data Frames. First one looks like (db1):

id   date_from   date_until   v1   v2
0    2015-06-17  2015-06-19    2    3
0    2015-06-17  2015-07-01    3    4 
1    2015-06-18  2015-06-20    5    6
1    2015-06-19  2015-07-12    5    4 

and the second one (db2), like this:

id      date      v3   v4
0    2015-06-17   4    3
1    2015-06-17   5    6 
1    2015-06-18   6    4  

no i want to compress them so they based on id and date. Date compression should be if db2.date is older or equal than db1.date_from and use the youngest one from db2 witch is still older or equal than db1.data_from. So compressed data should look something like this:

id   date_from   date_until      date      v1   v2   v3   v4
0    2015-06-17  2015-06-19   2015-06-17   2    3    4    3
0    2015-06-17  2015-07-01   2015-06-17   3    4    4    3
1    2015-06-18  2015-06-20   2015-06-18   5    6    6    4
1    2015-06-19  2015-07-12   2015-06-18   5    4    6    4

I can't find any solution except iterating trough all rows, but this is not good enough because my data is bigger 100k of lines. Is there any tool to do this in pandas or should i use something else?

Thank you!

2 Answers 2

3

Let's use pd.merge_asof:

pd.merge_asof(df1,df2, left_on=['date_from'], right_on=['date'], by='id')

Output:

   id  date_from  date_until  v1  v2       date  v3  v4
0   0 2015-06-17  2015-06-19   2   3 2015-06-17   4   3
1   0 2015-06-17  2015-07-01   3   4 2015-06-17   4   3
2   1 2015-06-18  2015-06-20   5   6 2015-06-18   6   4
3   1 2015-06-19  2015-07-12   5   4 2015-06-18   6   4
Sign up to request clarification or add additional context in comments.

2 Comments

Almost as if the question was tailor made for merge_asof :-)
hah! I hadn't thought of that. Just nice to see merge_asof get some attention
2

Another way to do it:

import pandas as pd

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

a = StringIO('''id,date_from,date_until,v1,v2
0,2015-06-17,2015-06-19,2,3
0,2015-06-17,2015-07-01,3,4
1,2015-06-18,2015-06-20,5,6
1,2015-06-19,2015-07-12,5,4''')

b = StringIO('''id,date,v3,v4
0,2015-06-17,4,3
1,2015-06-17,5,6 
1,2015-06-18,6,4''')

df1 = pd.read_csv(a, sep=',')
df2 = pd.read_csv(b, sep=',')
df1.merge(df2, how='left', left_on=['id', 'date_from'], right_on=['id', 'date']).ffill()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.