0
[pd.Series(pd.date_range(row[1].START_DATE, row[1].END_DATE)) for row in df[['START_DATE', 'END_DATE']].iterrows()]

Is there anyway to speed up this operation? Basically for a given date range I am creating all rows of dates in between them.

3
  • 1
    Doesn't make sense to create a list of series... mind explaining what you're trying to do so you can get better than surface answers that may or may not solve your problem? Commented Dec 19, 2018 at 11:29
  • 1
    Okay, good talk! Commented Dec 19, 2018 at 11:43
  • There's no need to be sarcastic. Just so you know the problem at hand is to create a df in which i can have rows corresponding to all dates between two dates and then plot it across on a ID vs date crosstab. Commented Dec 19, 2018 at 11:55

2 Answers 2

2

Use DataFrame.itertuples:

L = [pd.Series(pd.date_range(r.START_DATE, r.END_DATE)) for r in df.itertuples()]

Or zip of both columns:

L = [pd.Series(pd.date_range(s, e)) for s, e in zip(df['START_DATE'], df['END_DATE'])]

If want join together:

s = pd.concat(L, ignore_index=True)

Performance for 100 rows:

np.random.seed(123)

def random_dates(start, end, n=100):

    start_u = start.value//10**9
    end_u = end.value//10**9

    return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')

start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
df = pd.DataFrame({'START_DATE': start, 'END_DATE':random_dates(start, end)})
print (df)

In [155]: %timeit [pd.Series(pd.date_range(row[1].START_DATE, row[1].END_DATE)) for row in df[['START_DATE', 'END_DATE']].iterrows()]
33.5 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [156]: %timeit [pd.date_range(row[1].START_DATE, row[1].END_DATE) for row in df[['START_DATE', 'END_DATE']].iterrows()]
30.3 ms ± 1.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [157]: %timeit [pd.Series(pd.date_range(r.START_DATE, r.END_DATE)) for r in df.itertuples()]
25.3 ms ± 218 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [158]: %timeit [pd.Series(pd.date_range(s, e)) for s, e in zip(df['START_DATE'], df['END_DATE'])]
24.3 ms ± 594 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

And for 1000 rows:

start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
df = pd.DataFrame({'START_DATE': start, 'END_DATE':random_dates(start, end, n=1000)})

In [159]: %timeit [pd.Series(pd.date_range(row[1].START_DATE, row[1].END_DATE)) for row in df[['START_DATE', 'END_DATE']].iterrows()]
333 ms ± 3.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [160]: %timeit [pd.date_range(row[1].START_DATE, row[1].END_DATE) for row in df[['START_DATE', 'END_DATE']].iterrows()]
314 ms ± 36.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [161]: %timeit [pd.Series(pd.date_range(s, e)) for s, e in zip(df['START_DATE'], df['END_DATE'])]
243 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [162]: %timeit [pd.Series(pd.date_range(r.START_DATE, r.END_DATE)) for r in df.itertuples()]
246 ms ± 2.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sign up to request clarification or add additional context in comments.

7 Comments

Is itertuples faster?
@Mysterious - sure, faster.
That join is the exact thing I am doing later on!
Zip version is even faster! Perfect.
@Mysterious - yes, the best test with real data, also it depends of size of DataFrame.
|
2

Instead of creating a pd.Series on each iteration, do:

[pd.date_range(row[1].START_DATE, row[1].END_DATE))
 for row in df[['START_DATE', 'END_DATE']].iterrows()]

And create a dataframe from the result. Here's an example:

df = pd.DataFrame([
     {'start_date': pd.datetime(2019,1,1), 'end_date': pd.datetime(2019,1,10)},
     {'start_date': pd.datetime(2019,1,2), 'end_date': pd.datetime(2019,1,8)}, 
     {'start_date': pd.datetime(2019,1,6), 'end_date': pd.datetime(2019,1,14)} 
])

dr = [pd.date_range(df.loc[i,'start_date'], df.loc[i,'end_date']) for i,_ in df.iterrows()]

pd.DataFrame(dr)

      0          1          2          3          4          5  \
0 2019-01-01 2019-01-02 2019-01-03 2019-01-04 2019-01-05 2019-01-06   
1 2019-01-02 2019-01-03 2019-01-04 2019-01-05 2019-01-06 2019-01-07   
2 2019-01-06 2019-01-07 2019-01-08 2019-01-09 2019-01-10 2019-01-11   

       6          7          8          9  
0 2019-01-07 2019-01-08 2019-01-09 2019-01-10  
1 2019-01-08        NaT        NaT        NaT  
2 2019-01-12 2019-01-13 2019-01-14        NaT  

2 Comments

I find it very hard to believe anyone would ever have uses for data in this sort of format. I am hoping OP answers my clarifications in the comments, else I am writing this off as a lost cause...
Yes, IMO suggesting other approaches makes more sense than ways to speed up the current solution in this case. As indeed a list of Series makes no sense

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.