2

I have two dataframes. One that has timestamps and projects in it. The other has date ranges, projects and maturities, which have to be mapped to the corresponding project within a date range.

My problem is similar to this question, however the provided answer is very slow and I have an extra condition to be fulfilled. For starters my two dataframes kind of look like this:

In:
import pandas as pd
df_a = pd.DataFrame({'time': ('06.05.2015  16:15:16', '22.06.2015  08:52:05', '28.05.2015  18:20:21','28.06.2015  16:19:21'),
                     'project': ('project1', 'project2', 'project2', 'project1')})

df_b = pd.DataFrame({'start-date':  ('02.05.2015 00:00:00', '26.06.2015 00:00:00', '16.05.2015 00:00:00', '30.05.2015 00:00:00'),
                     'end-date':('24.06.2015 00:00:00', '27.07.2015 00:00:00', '27.05.2015 00:00:00', '27.06.2015 00:00:00'),
                     'project': ('project1','project1','project2','project2'),
                     'maturity': ('one','two', 'one','two')})

My code looks like this:


for i in df_a.project.unique():
    for j in df_b.project.unique():
        if i == j:
            for index_df_a, row_df_a in df_a.iterrows():
                for index_df_b, row_df_b in df_b.iterrows():
                    if (row_df_a['time'] >= row_df_b['start-date']) & (row_df_a['time'] <= row_df_b['end-date']):
                        df_a.loc[index_df_a, 'maturity'] = row_df_b.loc['maturity']
                        break


Out:
                   time   project maturity
0  06.05.2015  16:15:16  project1      one
1  22.06.2015  08:52:05  project2      one
2  28.05.2015  18:20:21  project2      NaN
3  28.06.2015  16:19:21  project1      NaN

Expected result:

                   time   project maturity
0  06.05.2015  16:15:16  project1      one
1  22.06.2015  08:52:05  project2      one
2  28.05.2015  18:20:21  project2      two
3  28.06.2015  16:19:21  project1      two

  1. The code I wrote with the help of the similar question is very slow. When I am applying it to my dataframe (df_a has ~ 900k rows and df_b has ~1.7k maturity to map), it takes a really long time to process even a thousand rows. Is there a way to speed up the process?
  2. I think my if i==j:statement is wrong. As seen in line 4 of the result: even though the project is mapped to project1 and the timestamp 28.06.2015 16:19:21is within the range of start:26.06.2015 00:00:00 | end: 27.07.2015 00:00:00 the maturity is NaN instead of two.
  3. Lastly, is it possible to create a condition, where even if the timestamp is not within give date ranges, the next provided maturity for each project is mapped (line 3)? So if 28.05.2015 18:20:21 is not in any date range, the next date range provides the maturity. In this case two.

Excuse if I am asking too much at once. I know it is best practice to derive the answer by asking the simple questions and achieve results stepwise, however I am not nearly experienced enough to be able to split the problem into smaller parts.

1
  • Look at pandas.merge_asof. I posted a quick answer but I think I missed some nuance of your problem. Commented Nov 4, 2019 at 18:29

1 Answer 1

2
  • This should be significantly faster than the current implementation
  • If you find yourself iterating through a dataframe, you're probably doing it wrong
  • Use pd.date_range with start-date and end-date to add a d_range column to df_b, then .isin can be used to find time from df_a inside of d_range
    • d_range will be a list of dates between start and end.
    • If time is not properly formatted, it will not be match the dates inside d_range.
  • The implementation will clean up the time columns in each dataframe
    • Set columns to datetime format
    • We only care about the date, not the time
    • As already stated, the datetime format is integral. If the columns are not properly formatted, time will not be found inside d_time.

Update the two dataframes:

import pandas as pd

# create dataframes from your test set and clean-up the datetime columns
df_a['time'] = (pd.to_datetime(df_a['time'], format='%d.%m.%Y %H:%M:%S')).dt.date
df_b['start-date'] = pd.to_datetime(df_b['start-date'], format='%d.%m.%Y %H:%M:%S').dt.date
df_b['end-date'] = pd.to_datetime(df_b['end-date'], format='%d.%m.%Y %H:%M:%S').dt.date

# df_a view

       time   project
 2015-05-06  project1
 2015-06-22  project2
 2015-05-28  project2
 2015-06-28  project1

# df_b view

start-date   end-date   project maturity
2015-05-02 2015-06-24  project1      one
2015-06-26 2015-07-27  project1      two
2015-05-16 2015-05-27  project2      one
2015-05-30 2015-06-27  project2      two

# add d_range to df_b
df_b['d_range'] = df_b[['start-date', 'end-date']].apply(lambda x: pd.date_range(x[0], x[1]), axis=1)

Add maturity to df_a

  • mask is the result of searching df_b for the date from df_a
    • mask matches the date of any project
  • return only the result for the matching project
def date_query(x):
    mask = df_b[['project', 'maturity']][df_b['d_range'].apply(lambda y: y.isin([x[0]]).any())].reset_index(drop=True)
    result = mask['maturity'][mask['project'] == x[1]].reset_index(drop=True)
    return result

# call function
df_a['maturity'] = df_a.apply(lambda x: date_query(x), axis=1)

# df_a updated

       time   project maturity
 2015-05-06  project1      one
 2015-06-22  project2      two
 2015-05-28  project2      NaN
 2015-06-28  project1      two

Item 3:

  • result from def date_query is a pandas.Series, if there is no matching date range, it will be empty, which can be checked with .empty
  • Update def date_query to check if result is empty. Call def check_min_timedelta if result is empty.
  • If there are multiple matching minimum values, .idxmin will return the first occurence
def check_min_timedelta(x):
    """
    Create a timedelta between time and end-date
    Return maturity for the row with the minimum time date
    """
    end_diff = abs(df_b['end-date'][df_b['project'] == x[1]] - x[0]).idxmin()
    return df_b['maturity'].loc[end_diff]

# update def date_query
def date_query(x):
    mask = df_b[['project', 'maturity']][df_b['d_range'].apply(lambda y: y.isin([x[0]]).any())].reset_index(drop=True)
    result = mask['maturity'][mask['project'] == x[1]].reset_index(drop=True)
    if result.empty:
        result = check_min_timedelta(x)
    return result

# call function
df_a['maturity'] = df_a.apply(lambda x: date_query(x), axis=1)

# final df_a:

       time   project maturity
 2015-05-06  project1      one
 2015-06-22  project2      two
 2015-05-28  project2      one
 2015-06-28  project1      two

Alternate Approach:

  • This is similar to your original approach of using inequalities to compare time to start-date & end-date
  • This approach does not rely on creating a column of date range lists
  • def date_query(x): has been updated and .dt.date has been removed.
df_a['time'] = pd.to_datetime(df_a['time'], format='%d.%m.%Y %H:%M:%S')
df_b['start-date'] = pd.to_datetime(df_b['start-date'], format='%d.%m.%Y %H:%M:%S')
df_b['end-date'] = pd.to_datetime(df_b['end-date'], format='%d.%m.%Y %H:%M:%S')

def check_min_timedelta(x):
    """
    Create a timedelta between time and end-date
    Return maturity for the row with the minimum time date
    """
    end_diff = abs(df_b['end-date'][df_b['project'] == x[1]] - x[0]).idxmin()
    return df_b['maturity'].loc[end_diff]

# update def date_query
def date_query(x):
    mask = df_b[['project', 'maturity']][df_b[['start-date', 'end-date']].apply(lambda y: ((x[0] >= y[0]) & (x[0] <= y[1])), axis=1)].reset_index(drop=True)
    result = mask['maturity'][mask['project'] == x[1]].reset_index(drop=True)
    if result.empty:
        result = check_min_timedelta(x)
    return result

# call function
df_a['maturity'] = df_a.apply(lambda x: date_query(x), axis=1)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.