Fast Approach to add rows for all dates between two columns in Dataframe

Question

I need a fast approach to add rows for all the dates between two columns. I have already tried the following approaches:

df.index.repete: add rows for all dates between two columns?

pd.melt: How can I add rows for all dates between two columns?

date_range + merge: Get all dates between startdate and enddate columns

itertuples: Add rows to dataframe each day between a range of two columns

SQL script using sqlite: Crashed the machine when trying to do a join between two dates.

When they worked they were all very slow (between 25 to 75 minutes), I have 1,2 million rows in my dataframe which extends to over 30 million. Is there anyway to do any of these solutions using something like apply or vectorization? Is there a way of doing this in just a few minutes max?

Data has following structure:

start_date   end_date     ID

The final structure should be:

start_date   end_date     Date     ID

Do you mind to post the df["start_date"].min() and df["end_date"].max() I'll need to create a reproducible example — rpanai
– rpanai, Commented Jan 9, 2020 at 14:24
Can you share enough code/data for us to be able to test any potential solutions? — AMC
– AMC, Commented Jan 9, 2020 at 14:54

Code Different · Accepted Answer · 2020-01-14 13:49:53Z

1

Calendrical calculation (pd.date_range) is a lot slower than generating a sequence of integers (np.arange). The answers you listed are pretty simple & elegant, but they are not very fast. Your volume of data calls for a different solution.

This answer assumes that ID is unique for each row. It took about 4.5 seconds for 1M rows on my iMac 2015. The final dataframe has ~25M rows.

import itertools

# `date_range` is slow so we only call it once
all_dates = pd.date_range(df['start_date'].min(), df['end_date'].max())

# For each day in the range, number them as 0, 1, 2, 3, ...
rank = all_dates.to_series().rank().astype(np.int64) - 1

# Change from `2020-01-01` to "day 500 in the all_dates array", for example
start = df['start_date'].map(rank).values
end = df['end_date'].map(rank).values

# This is where the magic happens. For each row, instead of saying
# `start_date = Jan 1, 2020` and `end_date = Jan 10, 2020`, we are
# creating a range of days: [500, 501, ... 509]
indices = list(itertools.chain.from_iterable([range(s,e+1) for s,e in zip(start, end)]))

# Now map day 500 back to Jan 1, 2020, day 501 back to Jan 2, 2020, and so on
dates = np.take(all_dates, indices)

# Align the rest of the columns to the expanded dates
duration = (end - start + 1).astype(np.int64)
ids = np.repeat(df['ID'], duration)
start_date = np.repeat(df['start_date'], duration)
end_date = np.repeat(df['end_date'], duration)

# Assemble the result
result = pd.DataFrame({
    'start_date': start_date,
    'end_date': end_date,
    'ID': ids,
    'Date': dates
})

edited Jan 14, 2020 at 13:49

answered Jan 10, 2020 at 3:59

Code Different

93.4k16 gold badges154 silver badges175 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

MartP Over a year ago

This did not work saying that the "ValueError: array length 300527269 does not match index length 301745242". I checked and the ID is not unique as I had thought.

Code Different Over a year ago

I may have introduced a bug during optimization. Try change range(s,e) to range(s,e+1)

Collectives™ on Stack Overflow

Fast Approach to add rows for all dates between two columns in Dataframe

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related