Check if dataframe records are between dates and match value within another dataframe

Question

I have 14 million rows and 20 columns in a dataframe named dfw (weather data) and 1900 rows and 15 columns in a dataframe named dfi (incident data) in python. I am trying to set a column named active in dfw to True where the dfw date column is between the start and end date columns of the dfi dataframe and where the dfw location column is equal to the dfi location column. I have the following code and I am not sure if it is the most efficient way to do it, but haven't had much luck using np.where(...) or df.where(...). Additionally, the start and ends dates of dfi vary but as long as the dfw date is between the start and end date of at least one dfi record then active should be True.

Here is what the two dataframes look like:

>>> dfi.head(5)

       start          end    location
0 2016-01-01   2016-01-10        LA01
1 2016-02-05   2016-02-12        NY01
2 2016-04-03   2016-04-10        LA02
3 2016-08-09   2016-08-13        FL03
4 2016-09-17   2016-09-19        LA01

>>> dfw.head(5)

       date   location
0 2016-01-01      LA01
1 2016-01-02      LA01
2 2016-01-12      LA01
3 2016-02-06      NY01
4 2016-11-05      NY02

Code:

for index, row in dfi.iterrows():
    start = row['start']
    end = row['end']
    mgrs = row['location']
    dfw.loc[dfw['DATE'].between(start, end) & (dfw['location'] == location), 'ACTIVE'] = True

Output:

>>> dfw.head(5)

       date   location    Active
0 2016-01-01      LA01      True 
1 2016-01-02      LA01      True
2 2016-01-12      LA01     False
3 2016-02-06      NY01      True
4 2016-11-05      NY02     False

I am curious if there is a more efficient way of doing this that avoids iterating over each row.

Do you reallydf1 mean the end dates of the last two rows in df1 are before the corresponding start dates? — itprorh66
– itprorh66, Commented Feb 16, 2022 at 0:42

Laurent · Accepted Answer · 2022-02-19 17:06:32Z

So, given the following dataframes:

import pandas as pd

dfi = pd.DataFrame(
    {
        "start": {
            0: "2016-01-01",
            1: "2016-02-05",
            2: "2016-04-03",
            3: "2016-08-09",
            4: "2016-09-17",
        },
        "end": {
            0: "2016-01-10",
            1: "2016-02-12",
            2: "2016-04-10",
            3: "2016-08-13",
            4: "2016-09-19",
        },
        "location": {0: "LA01", 1: "NY01", 2: "LA02", 3: "FL03", 4: "LA01"},
    }
)


dfw = pd.DataFrame(
    {
        "date": {
            0: "2016-01-01",
            1: "2016-01-02",
            2: "2016-01-12",
            3: "2016-02-06",
            4: "2016-11-05",
        },
        "location": {0: "LA01", 1: "LA01", 2: "LA01", 3: "NY01", 4: "NY02"},
    }
)

Here is a more idiomatic way to do it:

dfi["start"] = pd.to_datetime(dfi["start"])
dfi["end"] = pd.to_datetime(dfi["end"])

dfw = (
    dfw
    .assign(date=lambda df_: pd.to_datetime(df_["date"]))
    .assign(
        between=lambda df_: df_["date"].apply(
            lambda x: any(
                [
                    start <= x <= end
                    for start, end in zip(dfi["start"].values, dfi["end"].values)
                ]
            )
        )
    )
    .assign(
        same_location=lambda df_: df_["location"].apply(
            lambda x: any([x == location for location in dfi["location"].values])
        )
    )
    .assign(active=lambda df_: df_["between"] & df_["same_location"])
    .drop(columns=["between", "same_location"])
)

print(dfw)
# Output
        date location  active
0 2016-01-01     LA01    True
1 2016-01-02     LA01    True
2 2016-01-12     LA01   False
3 2016-02-06     NY01    True
4 2016-11-05     NY02   False

Collectives™ on Stack Overflow

Check if dataframe records are between dates and match value within another dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related