2

I need a fast approach to add rows for all the dates between two columns. I have already tried the following approaches:

df.index.repete: add rows for all dates between two columns?

pd.melt: How can I add rows for all dates between two columns?

date_range + merge: Get all dates between startdate and enddate columns

itertuples: Add rows to dataframe each day between a range of two columns

SQL script using sqlite: Crashed the machine when trying to do a join between two dates.

When they worked they were all very slow (between 25 to 75 minutes), I have 1,2 million rows in my dataframe which extends to over 30 million. Is there anyway to do any of these solutions using something like apply or vectorization? Is there a way of doing this in just a few minutes max?

Data has following structure:

start_date   end_date     ID

The final structure should be:

start_date   end_date     Date     ID
2
  • Do you mind to post the df["start_date"].min() and df["end_date"].max() I'll need to create a reproducible example Commented Jan 9, 2020 at 14:24
  • 1
    Can you share enough code/data for us to be able to test any potential solutions? Commented Jan 9, 2020 at 14:54

1 Answer 1

1

Calendrical calculation (pd.date_range) is a lot slower than generating a sequence of integers (np.arange). The answers you listed are pretty simple & elegant, but they are not very fast. Your volume of data calls for a different solution.

This answer assumes that ID is unique for each row. It took about 4.5 seconds for 1M rows on my iMac 2015. The final dataframe has ~25M rows.

import itertools

# `date_range` is slow so we only call it once
all_dates = pd.date_range(df['start_date'].min(), df['end_date'].max())

# For each day in the range, number them as 0, 1, 2, 3, ...
rank = all_dates.to_series().rank().astype(np.int64) - 1

# Change from `2020-01-01` to "day 500 in the all_dates array", for example
start = df['start_date'].map(rank).values
end = df['end_date'].map(rank).values

# This is where the magic happens. For each row, instead of saying
# `start_date = Jan 1, 2020` and `end_date = Jan 10, 2020`, we are
# creating a range of days: [500, 501, ... 509]
indices = list(itertools.chain.from_iterable([range(s,e+1) for s,e in zip(start, end)]))

# Now map day 500 back to Jan 1, 2020, day 501 back to Jan 2, 2020, and so on
dates = np.take(all_dates, indices)

# Align the rest of the columns to the expanded dates
duration = (end - start + 1).astype(np.int64)
ids = np.repeat(df['ID'], duration)
start_date = np.repeat(df['start_date'], duration)
end_date = np.repeat(df['end_date'], duration)

# Assemble the result
result = pd.DataFrame({
    'start_date': start_date,
    'end_date': end_date,
    'ID': ids,
    'Date': dates
})
Sign up to request clarification or add additional context in comments.

2 Comments

This did not work saying that the "ValueError: array length 300527269 does not match index length 301745242". I checked and the ID is not unique as I had thought.
I may have introduced a bug during optimization. Try change range(s,e) to range(s,e+1)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.