2

Hi Guys i am working on a dataset containing following example:

enter image description here

the data contains start_time, end_time, id and url. for one id and url group i have different in and out values the problem is that in and out values are in different rows, i want to fill the missing end_time/start_time values. for this i have to use the following logic:

  1. if i have values in start_time and end time is null then i have to fill the end_time with the closest end_time considering end_time >= start_time and delete the used/matched row
  2. after all the rows having star_time are filled and used/matched rows are deleted, and still remain some rows with empty start_time, then i have to fill the start_time with the same value as end_time.
  3. if no matching end_time value is found for the given start_time then i have to fill the end_time value with the same start_time value.

considering the above things in mind the expected result should be similar as following, i am giving output in two stages so that its easy to understand

  1. fill the matching end_times with start_time and delete the used/matched rows: enter image description here

  2. final output fill the remaining start_time/end_time values: enter image description here

currently i am using the following way to achieve this but i feel its not optimized:

 def process(self, param, context):
    df = context['data']
    # df = df.drop_duplicates()
    key_cols = param['keys_cols']
    start_time_col = param['start_time_col']
    end_time_col = param['end_time_col']
    guid_col = param.get('guid_col','guid')
    df_groupby = df.groupby(key_cols).size().reset_index()
    final_dfs = []
    condition = ''
    for key in key_cols:
        if condition == '':
            condition = '(df[\''+str(key)+"\']==row[\'"+str(key)+"\'])"
        else:
            condition = condition + ' & ' +'(df[\'' + str(key) + "\']==row[\'" + str(key) + "\'])"
    for index, row in df_groupby.iterrows():
        sub_df = df[eval(condition)]
        if sub_df[start_time_col].isnull().sum() != len(sub_df[start_time_col]) and (sub_df[end_time_col].isnull().sum() != len(sub_df[end_time_col])):
            sub_df = sub_df.sort_values([start_time_col, end_time_col], ascending=True)
            subdf_start_time_not_null = sub_df[sub_df[start_time_col].notnull()]
            subdf_end_time_not_null = sub_df[sub_df[end_time_col].notnull()]
            subdf_end_time_not_null['combined'] = subdf_end_time_not_null[end_time_col] +"__"+ subdf_end_time_not_null[guid_col]
            end_time_values = subdf_end_time_not_null['combined'].values.tolist()
            for row_number, (stime_index, stime_row) in enumerate(subdf_start_time_not_null.iterrows()):
                delete_index = row_number
                if row_number < len(end_time_values):
                    end_time_value = np.nan
                    if int(str(subdf_start_time_not_null.at[stime_index,start_time_col]).replace(":","").replace(" ","").replace("-","")) <= int(str(end_time_values[row_number]).split("__")[0].replace(":","").replace(" ","").replace("-","")):
                        end_time_value = end_time_values[row_number]
                        subdf_start_time_not_null.at[stime_index,end_time_col] = str(end_time_values[row_number]).split("__")[0]
                    else:
                        prev_index = end_time_values.index(end_time_values[row_number])
                        for end_time in end_time_values:
                            current_index = end_time_values.index(end_time)
                            if current_index > prev_index:
                                if int(str(subdf_start_time_not_null.at[stime_index,start_time_col]).replace(":","").replace(" ","").replace("-","")) <= int(str(end_time_values[current_index]).split("__")[0].replace(":","").replace(" ","").replace("-","")):
                                    subdf_start_time_not_null.at[stime_index, end_time_col] = end_time_values[current_index]
                                    delete_index = current_index
                                    end_time_value = end_time_values.pop(delete_index)
                                    break
                    subdf_end_time_not_null = subdf_end_time_not_null[subdf_end_time_not_null[guid_col]!=end_time_value.split("__")[1]]
                else:
                    subdf_start_time_not_null.at[stime_index,end_time_col] = subdf_start_time_not_null.at[stime_index,start_time_col]
            subdf_end_time_not_null.drop('combined', axis=1, inplace=True)
            sub_df = pd.concat([subdf_start_time_not_null,subdf_end_time_not_null])
        sub_df[start_time_col] = np.where(sub_df[start_time_col].isnull(),sub_df[end_time_col],sub_df[start_time_col])
        sub_df[end_time_col] = np.where(sub_df[end_time_col].isnull(),sub_df[start_time_col],sub_df[end_time_col])
        final_dfs.append(sub_df)
        # LOGGER.info('do something' +str(index))
    df = pd.concat(final_dfs)
    context['data'] = df
    context['continue'] = True
    return context

where param is as following:

param = {"keys_cols":['id', 'url'], "start_time_col":"start_time","end_time_col":"end_time"}

and "df" is the data.

please help to review and suggest how to make it more optimized, i have more than 70000 rows of data with more than 12000 pairs of id and urls in one file

looking forward to you guys.

Thanks

2
  • Few issues with your question: I couldn't get understanding of "used" row from the description. You will be able to get more help, if you simplify your ask. Did you consider putting both start and end times in a single column and then recreating new values based on sequential order? Commented Oct 22, 2021 at 12:48
  • @S2L used refers to the row that has been matched with closest end time, sorry if that created confusion. i have changed my statement. about putting both values in one column, but how would that solve my problem Commented Oct 24, 2021 at 2:56

2 Answers 2

1
+50

If I understand the requirements correctly, we can do all of this within pandas. There are essentially two steps here:

  1. use pandas.merge_asof to fill in nearest end_date
  2. use drop_duplicates to remove out records we used in step 1
text = StringIO(
    """
             id                url type          start_time            end_time
o6FlbuA_5565423  https://vaa.66new  out                 NaT 2021-08-25T15:23:28
o6FlbuA_5565423  https://vaa.66new  out                 NaT 2021-08-25T15:27:34
o6FlbuA_5565423  https://vaa.66new  out                 NaT 2021-08-25T15:23:52
o6FlbuA_5565423  https://vaa.66new   in 2021-08-25T15:23:37                 NaT
o6FlbuA_5565423  https://vaa.66new   in 2021-08-25T15:43:56                 NaT  # note: no record with `end_time` after this records `start_time`
o6FlbuA_5565423  https://vaa.66new  out                 NaT 2021-08-25T15:10:29
o6FlbuA_5565423  https://vaa.66new  out                 NaT 2021-08-25T15:25:00
o6FlbuA_5565423  https://vaa.66new  out                 NaT 2021-08-25T15:15:49
o6FlbuA_5565423  https://vaa.66new   in 2021-08-25T15:33:37 2021-08-25T15:34:37  # additional already complete record
"""
)
df = pd.read_csv(text, delim_whitespace=True, parse_dates=["start_time", "end_time"], comment="#")

# separate out unmatched `in` records and unmatched `out` records
df_in_unmatched = (
    df[(df.type == "in") & ~df.start_time.isna() & df.end_time.isna()]
    .drop(columns=["end_time"])
    .sort_values("start_time")
)
df_out_unmatched = (
    df[(df.type == "out") & df.start_time.isna() & ~df.end_time.isna()]
    .drop(columns=["type", "start_time"])
    .sort_values("end_time")
)

# match `in` records to closest `out` record with `out.end_time` >= `in.start_time`
df_in_matched = pd.merge_asof(
    df_in_unmatched,
    df_out_unmatched,
    by=["id", "url"],
    left_on="start_time",
    right_on="end_time",
    direction="forward",
    allow_exact_matches=True,
)

# fill in missing `end_time` for records with only `start_time`
df_in_matched["end_time"] = df_in_matched["end_time"].combine_first(
    df_in_matched["start_time"]
)

# combine matched records with remaining unmatched and deduplicate
# in order to remove "used" records
df_matched = (
    pd.concat([df_in_matched, df_out_unmatched], ignore_index=True)
    .drop_duplicates(subset=["id", "url", "end_time"], keep="first")
    .dropna(subset=["end_time"])
    .fillna({"type": "out"})
)

# fill in missing `start_time` for records with only `end_time`
df_matched["start_time"] = df_matched["start_time"].combine_first(
    df_matched["end_time"]
)

# combine matched records with unprocessed records: i.e. records
# that had both `start_time` and `end_time` (if extant)
df_final = pd.concat(
    [df_matched, df.dropna(subset=["start_time", "end_time"])], ignore_index=True
)

Result:

               id               url type         start_time             end_time
0 o6FlbuA_5565423 https://vaa.66new   in 2021-08-25 15:23:37 2021-08-25 15:23:52
1 o6FlbuA_5565423 https://vaa.66new   in 2021-08-25 15:43:56 2021-08-25 15:43:56
2 o6FlbuA_5565423 https://vaa.66new  out 2021-08-25 15:10:29 2021-08-25 15:10:29
3 o6FlbuA_5565423 https://vaa.66new  out 2021-08-25 15:15:49 2021-08-25 15:15:49
4 o6FlbuA_5565423 https://vaa.66new  out 2021-08-25 15:23:28 2021-08-25 15:23:28
5 o6FlbuA_5565423 https://vaa.66new  out 2021-08-25 15:25:00 2021-08-25 15:25:00
6 o6FlbuA_5565423 https://vaa.66new  out 2021-08-25 15:27:34 2021-08-25 15:27:34
7 o6FlbuA_5565423 https://vaa.66new   in 2021-08-25 15:33:37 2021-08-25 15:34:37
Sign up to request clarification or add additional context in comments.

1 Comment

This solution looks better, thanks for introducing pd.merge_asof to me i wasnt aware of that method before. though this solution also needs some tweeks but i will accept this answer as it is better in performance, thank you @onepan
0

Data:

>>> import pandas as pd
>>> df = pd.DataFrame(
    {"id" : ["o6FlbuA_5565423"]*8,
     "url" : ["https://vaa.66new"]*8,
     "type" : ["out"]*3 + ["in"]*2 + ["out"]*3,
     "start_time" : ["NULL"]*3 + ['2021-08-25 15:23:37', '2021-08-25 15:23:56'] +["NULL"]*3,
     "end_time" : ['2021-08-25 15:23:28', '2021-08-25 15:27:34', '2021-08-25 15:23:52', 'NULL', 'NULL', '2021-08-25 15:10:29', '2021-08-25 15:25:00', '2021-08-25 15:15:49']}
     )
>>> df[['start_time', 'end_time']] = df[['start_time', 'end_time']].apply(pd.to_datetime, errors='coerce')
>>> df

    id                  url                 type    start_time              end_time
0   o6FlbuA_5565423     https://vaa.66new   out     NaT                     2021-08-25 15:23:28
1   o6FlbuA_5565423     https://vaa.66new   out     NaT                     2021-08-25 15:27:34
2   o6FlbuA_5565423     https://vaa.66new   out     NaT                     2021-08-25 15:23:52
3   o6FlbuA_5565423     https://vaa.66new   in      2021-08-25 15:23:37     NaT
4   o6FlbuA_5565423     https://vaa.66new   in      2021-08-25 15:23:56     NaT
5   o6FlbuA_5565423     https://vaa.66new   out     NaT                     2021-08-25 15:10:29
6   o6FlbuA_5565423     https://vaa.66new   out     NaT                     2021-08-25 15:25:00
7   o6FlbuA_5565423     https://vaa.66new   out     NaT                     2021-08-25 15:15:49

Solution:

# Get epoch time for both 'start_time' and 'end_time' columns
>>> df['start_time_epoch'] = df.start_time.apply(lambda x: x.timestamp() if not pd.isna(x) else None).astype('Int64')
>>> df['end_time_epoch'] = df.end_time.apply(lambda x: x.timestamp() if not pd.isna(x) else None).astype('Int64')

# Get closest value
>>> to_remove = []
>>> def fun(x):
>>>     for i in df.sort_values("end_time_epoch").end_time_epoch:
>>>         if i >= x.start_time_epoch:
>>>             to_remove.append(i)
>>>             return pd.to_datetime(i, unit='s')
>>>     else:
>>>         return pd.to_datetime(x.start_time_epoch, unit='s')
>>> r = df[df.start_time.notna() & df.end_time.isna()].apply(fun, axis=1).to_list()

# Fill with gotten values
>>> df.loc[df.start_time.notna() & df.end_time.isna(), 'end_time'] = r

# Remove rows from where we filled missed values.
>>> df = df[~df.end_time_epoch.isin(to_remove)]

# Fill 'start_time' with 'end_time'
>>> df.loc[df.start_time.isna(), 'start_time'] = df.loc[df.start_time.isna(), 'end_time'].to_list()

# Drop helping variables.
>>> df.drop(["start_time_epoch", "end_time_epoch"], axis=1, inplace=True)

>>> df

    id                  url                 type    start_time              end_time
0   o6FlbuA_5565423     https://vaa.66new   out     2021-08-25 15:23:28     2021-08-25 15:23:28
1   o6FlbuA_5565423     https://vaa.66new   out     2021-08-25 15:27:34     2021-08-25 15:27:34
3   o6FlbuA_5565423     https://vaa.66new   in      2021-08-25 15:23:37     2021-08-25 15:23:52
4   o6FlbuA_5565423     https://vaa.66new   in      2021-08-25 15:23:56     2021-08-25 15:25:00
5   o6FlbuA_5565423     https://vaa.66new   out     2021-08-25 15:10:29     2021-08-25 15:10:29
7   o6FlbuA_5565423     https://vaa.66new   out     2021-08-25 15:15:49     2021-08-25 15:15:49

4 Comments

thanks for the reply, the solution seemingly works fine but sometimes it gives problem while handling None values and converting them to Int64, that moment it makes all df as empty
also if the the case and the data has only one row with null start_time then it also fails:
Please share the data where the solution is fail.
id url type start_time end_time o6FlbuA_5565423 https://vaa.66new in None 2021-08-25 15:23:52

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.