Hi Guys i am working on a dataset containing following example:
the data contains start_time, end_time, id and url. for one id and url group i have different in and out values the problem is that in and out values are in different rows, i want to fill the missing end_time/start_time values. for this i have to use the following logic:
- if i have values in start_time and end time is null then i have to fill the end_time with the closest end_time considering end_time >= start_time and delete the used/matched row
- after all the rows having star_time are filled and used/matched rows are deleted, and still remain some rows with empty start_time, then i have to fill the start_time with the same value as end_time.
- if no matching end_time value is found for the given start_time then i have to fill the end_time value with the same start_time value.
considering the above things in mind the expected result should be similar as following, i am giving output in two stages so that its easy to understand
currently i am using the following way to achieve this but i feel its not optimized:
def process(self, param, context):
df = context['data']
# df = df.drop_duplicates()
key_cols = param['keys_cols']
start_time_col = param['start_time_col']
end_time_col = param['end_time_col']
guid_col = param.get('guid_col','guid')
df_groupby = df.groupby(key_cols).size().reset_index()
final_dfs = []
condition = ''
for key in key_cols:
if condition == '':
condition = '(df[\''+str(key)+"\']==row[\'"+str(key)+"\'])"
else:
condition = condition + ' & ' +'(df[\'' + str(key) + "\']==row[\'" + str(key) + "\'])"
for index, row in df_groupby.iterrows():
sub_df = df[eval(condition)]
if sub_df[start_time_col].isnull().sum() != len(sub_df[start_time_col]) and (sub_df[end_time_col].isnull().sum() != len(sub_df[end_time_col])):
sub_df = sub_df.sort_values([start_time_col, end_time_col], ascending=True)
subdf_start_time_not_null = sub_df[sub_df[start_time_col].notnull()]
subdf_end_time_not_null = sub_df[sub_df[end_time_col].notnull()]
subdf_end_time_not_null['combined'] = subdf_end_time_not_null[end_time_col] +"__"+ subdf_end_time_not_null[guid_col]
end_time_values = subdf_end_time_not_null['combined'].values.tolist()
for row_number, (stime_index, stime_row) in enumerate(subdf_start_time_not_null.iterrows()):
delete_index = row_number
if row_number < len(end_time_values):
end_time_value = np.nan
if int(str(subdf_start_time_not_null.at[stime_index,start_time_col]).replace(":","").replace(" ","").replace("-","")) <= int(str(end_time_values[row_number]).split("__")[0].replace(":","").replace(" ","").replace("-","")):
end_time_value = end_time_values[row_number]
subdf_start_time_not_null.at[stime_index,end_time_col] = str(end_time_values[row_number]).split("__")[0]
else:
prev_index = end_time_values.index(end_time_values[row_number])
for end_time in end_time_values:
current_index = end_time_values.index(end_time)
if current_index > prev_index:
if int(str(subdf_start_time_not_null.at[stime_index,start_time_col]).replace(":","").replace(" ","").replace("-","")) <= int(str(end_time_values[current_index]).split("__")[0].replace(":","").replace(" ","").replace("-","")):
subdf_start_time_not_null.at[stime_index, end_time_col] = end_time_values[current_index]
delete_index = current_index
end_time_value = end_time_values.pop(delete_index)
break
subdf_end_time_not_null = subdf_end_time_not_null[subdf_end_time_not_null[guid_col]!=end_time_value.split("__")[1]]
else:
subdf_start_time_not_null.at[stime_index,end_time_col] = subdf_start_time_not_null.at[stime_index,start_time_col]
subdf_end_time_not_null.drop('combined', axis=1, inplace=True)
sub_df = pd.concat([subdf_start_time_not_null,subdf_end_time_not_null])
sub_df[start_time_col] = np.where(sub_df[start_time_col].isnull(),sub_df[end_time_col],sub_df[start_time_col])
sub_df[end_time_col] = np.where(sub_df[end_time_col].isnull(),sub_df[start_time_col],sub_df[end_time_col])
final_dfs.append(sub_df)
# LOGGER.info('do something' +str(index))
df = pd.concat(final_dfs)
context['data'] = df
context['continue'] = True
return context
where param is as following:
param = {"keys_cols":['id', 'url'], "start_time_col":"start_time","end_time_col":"end_time"}
and "df" is the data.
please help to review and suggest how to make it more optimized, i have more than 70000 rows of data with more than 12000 pairs of id and urls in one file
looking forward to you guys.
Thanks


