0

I have a dataframe with multiple columns, including analysis_date (datetime), and forecast_hour (int). I want to add a new column called total_hours, which is the sum of the hour component of analysis_date plus the corresponding forecast_hour in that row. Here's a visual example:

original dataframe:

analysis_date | forecast_hour
12-2-19-05    | 3
12-2-19-06    | 3
12-2-19-07    | 3
12-2-19-08    | 3

dataframe after calculation:

analysis_date | forecast_hour | total_hours
12-2-19-05    | 3             | 8
12-2-19-06    | 3             | 9
12-2-19-07    | 3             | 10
12-2-19-08    | 3             | 11

Here is the current logic that does what I want:

df['total_hours'] = df.apply(lambda row: row.analysis_date.hour + row.forecast_hours_out, axis=1)

Unfortunately, this is too slow for my application, it takes around 15 seconds for a dataframe with a few hundred thousand entries. I have tried using the swifter library, but unfortunately, it took approximately as long (if not longer) than my current implementation.

1 Answer 1

3

apply is slow because it is not vectorized. This should do what you want (assuming df['analysis_date'] is a datetime64):

df['total_hours'] = df['analysis_date'].dt.hour + df['forecast_hour']
Sign up to request clarification or add additional context in comments.

2 Comments

Similar/related question: is it possible for me to add the hours to the analysis_date? something along the hours of df['analysis_date'].dt+ timedelta(hours=df['forecast_hour']) where the output is a datetime64?
@P.V. Yep, pd.to_timedelta is what you are looking for. df['analysis_date'] + pd.to_timedelta(df['forecast_hour'], unit='hours')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.