pandas combine rows based on conditions

Question

Hi Guys i am working on a dataset containing following example:

the data contains start_time, end_time, id and url. for one id and url group i have different in and out values the problem is that in and out values are in different rows, i want to fill the missing end_time/start_time values. for this i have to use the following logic:

if i have values in start_time and end time is null then i have to fill the end_time with the closest end_time considering end_time >= start_time and delete the used/matched row
after all the rows having star_time are filled and used/matched rows are deleted, and still remain some rows with empty start_time, then i have to fill the start_time with the same value as end_time.
if no matching end_time value is found for the given start_time then i have to fill the end_time value with the same start_time value.

considering the above things in mind the expected result should be similar as following, i am giving output in two stages so that its easy to understand

fill the matching end_times with start_time and delete the used/matched rows:
final output fill the remaining start_time/end_time values:

currently i am using the following way to achieve this but i feel its not optimized:

 def process(self, param, context):
    df = context['data']
    # df = df.drop_duplicates()
    key_cols = param['keys_cols']
    start_time_col = param['start_time_col']
    end_time_col = param['end_time_col']
    guid_col = param.get('guid_col','guid')
    df_groupby = df.groupby(key_cols).size().reset_index()
    final_dfs = []
    condition = ''
    for key in key_cols:
        if condition == '':
            condition = '(df[\''+str(key)+"\']==row[\'"+str(key)+"\'])"
        else:
            condition = condition + ' & ' +'(df[\'' + str(key) + "\']==row[\'" + str(key) + "\'])"
    for index, row in df_groupby.iterrows():
        sub_df = df[eval(condition)]
        if sub_df[start_time_col].isnull().sum() != len(sub_df[start_time_col]) and (sub_df[end_time_col].isnull().sum() != len(sub_df[end_time_col])):
            sub_df = sub_df.sort_values([start_time_col, end_time_col], ascending=True)
            subdf_start_time_not_null = sub_df[sub_df[start_time_col].notnull()]
            subdf_end_time_not_null = sub_df[sub_df[end_time_col].notnull()]
            subdf_end_time_not_null['combined'] = subdf_end_time_not_null[end_time_col] +"__"+ subdf_end_time_not_null[guid_col]
            end_time_values = subdf_end_time_not_null['combined'].values.tolist()
            for row_number, (stime_index, stime_row) in enumerate(subdf_start_time_not_null.iterrows()):
                delete_index = row_number
                if row_number < len(end_time_values):
                    end_time_value = np.nan
                    if int(str(subdf_start_time_not_null.at[stime_index,start_time_col]).replace(":","").replace(" ","").replace("-","")) <= int(str(end_time_values[row_number]).split("__")[0].replace(":","").replace(" ","").replace("-","")):
                        end_time_value = end_time_values[row_number]
                        subdf_start_time_not_null.at[stime_index,end_time_col] = str(end_time_values[row_number]).split("__")[0]
                    else:
                        prev_index = end_time_values.index(end_time_values[row_number])
                        for end_time in end_time_values:
                            current_index = end_time_values.index(end_time)
                            if current_index > prev_index:
                                if int(str(subdf_start_time_not_null.at[stime_index,start_time_col]).replace(":","").replace(" ","").replace("-","")) <= int(str(end_time_values[current_index]).split("__")[0].replace(":","").replace(" ","").replace("-","")):
                                    subdf_start_time_not_null.at[stime_index, end_time_col] = end_time_values[current_index]
                                    delete_index = current_index
                                    end_time_value = end_time_values.pop(delete_index)
                                    break
                    subdf_end_time_not_null = subdf_end_time_not_null[subdf_end_time_not_null[guid_col]!=end_time_value.split("__")[1]]
                else:
                    subdf_start_time_not_null.at[stime_index,end_time_col] = subdf_start_time_not_null.at[stime_index,start_time_col]
            subdf_end_time_not_null.drop('combined', axis=1, inplace=True)
            sub_df = pd.concat([subdf_start_time_not_null,subdf_end_time_not_null])
        sub_df[start_time_col] = np.where(sub_df[start_time_col].isnull(),sub_df[end_time_col],sub_df[start_time_col])
        sub_df[end_time_col] = np.where(sub_df[end_time_col].isnull(),sub_df[start_time_col],sub_df[end_time_col])
        final_dfs.append(sub_df)
        # LOGGER.info('do something' +str(index))
    df = pd.concat(final_dfs)
    context['data'] = df
    context['continue'] = True
    return context

where param is as following:

param = {"keys_cols":['id', 'url'], "start_time_col":"start_time","end_time_col":"end_time"}

and "df" is the data.

please help to review and suggest how to make it more optimized, i have more than 70000 rows of data with more than 12000 pairs of id and urls in one file

looking forward to you guys.

Thanks

Few issues with your question: I couldn't get understanding of "used" row from the description. You will be able to get more help, if you simplify your ask. Did you consider putting both start and end times in a single column and then recreating new values based on sequential order? — S2L
– S2L, Commented Oct 22, 2021 at 12:48
@S2L used refers to the row that has been matched with closest end time, sorry if that created confusion. i have changed my statement. about putting both values in one column, but how would that solve my problem — Wasif Tanveer
– Wasif Tanveer, Commented Oct 24, 2021 at 2:56

onepan · Accepted Answer · 2021-10-26 02:03:49Z

If I understand the requirements correctly, we can do all of this within pandas. There are essentially two steps here:

use pandas.merge_asof to fill in nearest end_date
use drop_duplicates to remove out records we used in step 1

text = StringIO(
    """
             id                url type          start_time            end_time
o6FlbuA_5565423  https://vaa.66new  out                 NaT 2021-08-25T15:23:28
o6FlbuA_5565423  https://vaa.66new  out                 NaT 2021-08-25T15:27:34
o6FlbuA_5565423  https://vaa.66new  out                 NaT 2021-08-25T15:23:52
o6FlbuA_5565423  https://vaa.66new   in 2021-08-25T15:23:37                 NaT
o6FlbuA_5565423  https://vaa.66new   in 2021-08-25T15:43:56                 NaT  # note: no record with `end_time` after this records `start_time`
o6FlbuA_5565423  https://vaa.66new  out                 NaT 2021-08-25T15:10:29
o6FlbuA_5565423  https://vaa.66new  out                 NaT 2021-08-25T15:25:00
o6FlbuA_5565423  https://vaa.66new  out                 NaT 2021-08-25T15:15:49
o6FlbuA_5565423  https://vaa.66new   in 2021-08-25T15:33:37 2021-08-25T15:34:37  # additional already complete record
"""
)
df = pd.read_csv(text, delim_whitespace=True, parse_dates=["start_time", "end_time"], comment="#")

# separate out unmatched `in` records and unmatched `out` records
df_in_unmatched = (
    df[(df.type == "in") & ~df.start_time.isna() & df.end_time.isna()]
    .drop(columns=["end_time"])
    .sort_values("start_time")
)
df_out_unmatched = (
    df[(df.type == "out") & df.start_time.isna() & ~df.end_time.isna()]
    .drop(columns=["type", "start_time"])
    .sort_values("end_time")
)

# match `in` records to closest `out` record with `out.end_time` >= `in.start_time`
df_in_matched = pd.merge_asof(
    df_in_unmatched,
    df_out_unmatched,
    by=["id", "url"],
    left_on="start_time",
    right_on="end_time",
    direction="forward",
    allow_exact_matches=True,
)

# fill in missing `end_time` for records with only `start_time`
df_in_matched["end_time"] = df_in_matched["end_time"].combine_first(
    df_in_matched["start_time"]
)

# combine matched records with remaining unmatched and deduplicate
# in order to remove "used" records
df_matched = (
    pd.concat([df_in_matched, df_out_unmatched], ignore_index=True)
    .drop_duplicates(subset=["id", "url", "end_time"], keep="first")
    .dropna(subset=["end_time"])
    .fillna({"type": "out"})
)

# fill in missing `start_time` for records with only `end_time`
df_matched["start_time"] = df_matched["start_time"].combine_first(
    df_matched["end_time"]
)

# combine matched records with unprocessed records: i.e. records
# that had both `start_time` and `end_time` (if extant)
df_final = pd.concat(
    [df_matched, df.dropna(subset=["start_time", "end_time"])], ignore_index=True
)

Result:

               id               url type         start_time             end_time
0 o6FlbuA_5565423 https://vaa.66new   in 2021-08-25 15:23:37 2021-08-25 15:23:52
1 o6FlbuA_5565423 https://vaa.66new   in 2021-08-25 15:43:56 2021-08-25 15:43:56
2 o6FlbuA_5565423 https://vaa.66new  out 2021-08-25 15:10:29 2021-08-25 15:10:29
3 o6FlbuA_5565423 https://vaa.66new  out 2021-08-25 15:15:49 2021-08-25 15:15:49
4 o6FlbuA_5565423 https://vaa.66new  out 2021-08-25 15:23:28 2021-08-25 15:23:28
5 o6FlbuA_5565423 https://vaa.66new  out 2021-08-25 15:25:00 2021-08-25 15:25:00
6 o6FlbuA_5565423 https://vaa.66new  out 2021-08-25 15:27:34 2021-08-25 15:27:34
7 o6FlbuA_5565423 https://vaa.66new   in 2021-08-25 15:33:37 2021-08-25 15:34:37

This solution looks better, thanks for introducing pd.merge_asof to me i wasnt aware of that method before. though this solution also needs some tweeks but i will accept this answer as it is better in performance, thank you @onepan

Amir saleem · Accepted Answer · 2021-10-23 20:01:19Z

0

Data:

>>> import pandas as pd
>>> df = pd.DataFrame(
    {"id" : ["o6FlbuA_5565423"]*8,
     "url" : ["https://vaa.66new"]*8,
     "type" : ["out"]*3 + ["in"]*2 + ["out"]*3,
     "start_time" : ["NULL"]*3 + ['2021-08-25 15:23:37', '2021-08-25 15:23:56'] +["NULL"]*3,
     "end_time" : ['2021-08-25 15:23:28', '2021-08-25 15:27:34', '2021-08-25 15:23:52', 'NULL', 'NULL', '2021-08-25 15:10:29', '2021-08-25 15:25:00', '2021-08-25 15:15:49']}
     )
>>> df[['start_time', 'end_time']] = df[['start_time', 'end_time']].apply(pd.to_datetime, errors='coerce')
>>> df

    id                  url                 type    start_time              end_time
0   o6FlbuA_5565423     https://vaa.66new   out     NaT                     2021-08-25 15:23:28
1   o6FlbuA_5565423     https://vaa.66new   out     NaT                     2021-08-25 15:27:34
2   o6FlbuA_5565423     https://vaa.66new   out     NaT                     2021-08-25 15:23:52
3   o6FlbuA_5565423     https://vaa.66new   in      2021-08-25 15:23:37     NaT
4   o6FlbuA_5565423     https://vaa.66new   in      2021-08-25 15:23:56     NaT
5   o6FlbuA_5565423     https://vaa.66new   out     NaT                     2021-08-25 15:10:29
6   o6FlbuA_5565423     https://vaa.66new   out     NaT                     2021-08-25 15:25:00
7   o6FlbuA_5565423     https://vaa.66new   out     NaT                     2021-08-25 15:15:49

Solution:

# Get epoch time for both 'start_time' and 'end_time' columns
>>> df['start_time_epoch'] = df.start_time.apply(lambda x: x.timestamp() if not pd.isna(x) else None).astype('Int64')
>>> df['end_time_epoch'] = df.end_time.apply(lambda x: x.timestamp() if not pd.isna(x) else None).astype('Int64')

# Get closest value
>>> to_remove = []
>>> def fun(x):
>>>     for i in df.sort_values("end_time_epoch").end_time_epoch:
>>>         if i >= x.start_time_epoch:
>>>             to_remove.append(i)
>>>             return pd.to_datetime(i, unit='s')
>>>     else:
>>>         return pd.to_datetime(x.start_time_epoch, unit='s')
>>> r = df[df.start_time.notna() & df.end_time.isna()].apply(fun, axis=1).to_list()

# Fill with gotten values
>>> df.loc[df.start_time.notna() & df.end_time.isna(), 'end_time'] = r

# Remove rows from where we filled missed values.
>>> df = df[~df.end_time_epoch.isin(to_remove)]

# Fill 'start_time' with 'end_time'
>>> df.loc[df.start_time.isna(), 'start_time'] = df.loc[df.start_time.isna(), 'end_time'].to_list()

# Drop helping variables.
>>> df.drop(["start_time_epoch", "end_time_epoch"], axis=1, inplace=True)

>>> df

    id                  url                 type    start_time              end_time
0   o6FlbuA_5565423     https://vaa.66new   out     2021-08-25 15:23:28     2021-08-25 15:23:28
1   o6FlbuA_5565423     https://vaa.66new   out     2021-08-25 15:27:34     2021-08-25 15:27:34
3   o6FlbuA_5565423     https://vaa.66new   in      2021-08-25 15:23:37     2021-08-25 15:23:52
4   o6FlbuA_5565423     https://vaa.66new   in      2021-08-25 15:23:56     2021-08-25 15:25:00
5   o6FlbuA_5565423     https://vaa.66new   out     2021-08-25 15:10:29     2021-08-25 15:10:29
7   o6FlbuA_5565423     https://vaa.66new   out     2021-08-25 15:15:49     2021-08-25 15:15:49

edited Oct 23, 2021 at 20:01

answered Oct 23, 2021 at 19:54

Amir saleem

1,4961 gold badge10 silver badges12 bronze badges

4 Comments

Wasif Tanveer Over a year ago

thanks for the reply, the solution seemingly works fine but sometimes it gives problem while handling None values and converting them to Int64, that moment it makes all df as empty

Wasif Tanveer Over a year ago

also if the the case and the data has only one row with null start_time then it also fails:

Amir saleem Over a year ago

Please share the data where the solution is fail.

Wasif Tanveer Over a year ago

id                  url                 type    start_time              end_time  o6FlbuA_5565423     https://vaa.66new   in      None     2021-08-25 15:23:52

Collectives™ on Stack Overflow

pandas combine rows based on conditions

2 Answers 2

1 Comment

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related