1

I have 14 million rows and 20 columns in a dataframe named dfw (weather data) and 1900 rows and 15 columns in a dataframe named dfi (incident data) in python. I am trying to set a column named active in dfw to True where the dfw date column is between the start and end date columns of the dfi dataframe and where the dfw location column is equal to the dfi location column. I have the following code and I am not sure if it is the most efficient way to do it, but haven't had much luck using np.where(...) or df.where(...). Additionally, the start and ends dates of dfi vary but as long as the dfw date is between the start and end date of at least one dfi record then active should be True.

Here is what the two dataframes look like:

>>> dfi.head(5)

       start          end    location
0 2016-01-01   2016-01-10        LA01
1 2016-02-05   2016-02-12        NY01
2 2016-04-03   2016-04-10        LA02
3 2016-08-09   2016-08-13        FL03
4 2016-09-17   2016-09-19        LA01

>>> dfw.head(5)

       date   location
0 2016-01-01      LA01
1 2016-01-02      LA01
2 2016-01-12      LA01
3 2016-02-06      NY01
4 2016-11-05      NY02

Code:

for index, row in dfi.iterrows():
    start = row['start']
    end = row['end']
    mgrs = row['location']
    dfw.loc[dfw['DATE'].between(start, end) & (dfw['location'] == location), 'ACTIVE'] = True

Output:

>>> dfw.head(5)

       date   location    Active
0 2016-01-01      LA01      True 
1 2016-01-02      LA01      True
2 2016-01-12      LA01     False
3 2016-02-06      NY01      True
4 2016-11-05      NY02     False

I am curious if there is a more efficient way of doing this that avoids iterating over each row.

3
  • 1
    Kindly add an example with expected output Commented Feb 15, 2022 at 20:15
  • Do you reallydf1 mean the end dates of the last two rows in df1 are before the corresponding start dates? Commented Feb 16, 2022 at 0:42
  • @itprorh66 they should have been 2016 not 2015. Commented Feb 16, 2022 at 1:17

1 Answer 1

1

So, given the following dataframes:

import pandas as pd

dfi = pd.DataFrame(
    {
        "start": {
            0: "2016-01-01",
            1: "2016-02-05",
            2: "2016-04-03",
            3: "2016-08-09",
            4: "2016-09-17",
        },
        "end": {
            0: "2016-01-10",
            1: "2016-02-12",
            2: "2016-04-10",
            3: "2016-08-13",
            4: "2016-09-19",
        },
        "location": {0: "LA01", 1: "NY01", 2: "LA02", 3: "FL03", 4: "LA01"},
    }
)


dfw = pd.DataFrame(
    {
        "date": {
            0: "2016-01-01",
            1: "2016-01-02",
            2: "2016-01-12",
            3: "2016-02-06",
            4: "2016-11-05",
        },
        "location": {0: "LA01", 1: "LA01", 2: "LA01", 3: "NY01", 4: "NY02"},
    }
)

Here is a more idiomatic way to do it:

dfi["start"] = pd.to_datetime(dfi["start"])
dfi["end"] = pd.to_datetime(dfi["end"])

dfw = (
    dfw
    .assign(date=lambda df_: pd.to_datetime(df_["date"]))
    .assign(
        between=lambda df_: df_["date"].apply(
            lambda x: any(
                [
                    start <= x <= end
                    for start, end in zip(dfi["start"].values, dfi["end"].values)
                ]
            )
        )
    )
    .assign(
        same_location=lambda df_: df_["location"].apply(
            lambda x: any([x == location for location in dfi["location"].values])
        )
    )
    .assign(active=lambda df_: df_["between"] & df_["same_location"])
    .drop(columns=["between", "same_location"])
)

print(dfw)
# Output
        date location  active
0 2016-01-01     LA01    True
1 2016-01-02     LA01    True
2 2016-01-12     LA01   False
3 2016-02-06     NY01    True
4 2016-11-05     NY02   False
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.