2

I have a data frame of dates in pandas and I want to filter it such that 'date_id' is between 'start_date' and 'end_date'

     date_id    start_date  end_date
0   2010-06-04  2008-08-01  2008-09-26
1   2010-06-04  2008-08-01  2008-09-26
2   2010-06-04  2008-08-01  2008-09-26
3   2010-06-04  2008-08-26  2008-10-26
4   2010-06-04  2010-05-01  2010-09-26
5   2010-06-04  2008-08-01  2008-09-26
6   2010-06-04  2008-08-01  2008-09-26
7   2010-09-04  2010-08-01  2010-09-26

I've tried using the code below:

df[(df['date_id'] >= df['start_date'] & df['date_id']<= df['end_date')]

The code above results in a key error. I am a new pandas user so any assistance/documentation would be incredibly helpful.

3 Answers 3

2

You can use between!

df['date_id'].between(df['start_date'],df['end_date_y'])

and to filter, just use .loc

df.loc[df['date_id'].between(df['start_date'],df['end_date_y'])]


     date_id start_date end_date_y
4 2010-06-04 2010-05-01 2010-09-26
7 2010-09-04 2010-08-01 2010-09-26
Sign up to request clarification or add additional context in comments.

2 Comments

Also in my answer.
I was working on the solution before you posted it I guess! good work
1

You can also use query as:

df.query("start_date <= date_id <=  end_date_y")

    date_id     start_date  end_date_y
4   2010-06-04  2010-05-01  2010-09-26
7   2010-09-04  2010-08-01  2010-09-26

Comments

1

I think need change column name to end_date_y and add () because operator precedence:

df1 = df[(df['date_id'] >= df['start_date']) & (df['date_id']<= df['end_date_y'])]

Or use between:

df1 = df[df['date_id'].between(df['start_date'], df['end_date_y'])]
print (df1)
     date_id start_date end_date_y
4 2010-06-04 2010-05-01 2010-09-26
7 2010-09-04 2010-08-01 2010-09-26

Performance:

Depends of number of rows and number of matched rows, so the best test in real data.

#[80000 rows x 3 columns]
df = pd.concat([df] * 10000, ignore_index=True)
#print (df)

In [236]: %timeit df[df['date_id'].between(df['start_date'], df['end_date_y'])]
2.44 ms ± 92.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [237]: %timeit df[(df['date_id'] >= df['start_date']) & (df['date_id']<= df['end_date_y'])]
2.42 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [238]: %timeit df.query("start_date <= date_id <=  end_date_y")
4.45 ms ± 14.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.