1

I do have a dataframe with two columns: date and bill_id. Dates range in date column is one year from 01-01-2017 to 30-12-2017. There are 1000 unique bill_ids. Each bill_id may occur at least once in bill_id column. The result is a DataFrame of size: 2 column, 1000000 rows...

     dt   |bill_id

01-01-2017 bill_1
01-01-2017 bill_2
02-01-2017 bill_1
02-01-2017 bill_3
03-01-2017 bill_4
03-01-2017 bill_4

so, some name_ids may occur on specific day while other not.

What I want to achieve is a dataframe in a format so all unique bill_ids are columns, all unique dates are rows and each bill_id has either 0 or 1 or 2 for corresponding day value where 0 = did not appear on that date yet, 1 appeared on that date, 2 did not appear on that date but existed before e.g.

if bill_id existed on 02-01-2017 then it would have 0 on 01-01-2017, 1 on 02-01-2017 and 2 on 03-01-2017 and 2 on all consequetive days.

I did it in few steps but the code does not scale more as it is slow:

def map_values(row, df_z, c):
    subs = df_z[[c, 'bill_id', 'date']].loc[df_z['date'] == row['dt']]
    if c not in subs['bill_id']:
        row[c] = max(subs[c].tolist())
    else:
        val = df_z[c].loc[(df_z['date'] == row['dt']) & (df_z['bill_id'] == c)].values
        assert len(val) == 1
        row[c] = val[0]
    return row


def map_to_one(x):
    bills_x = x['bill_id'].tolist()

    for b in bills_x:
        try:
            x[b].loc[x['bill_id'] == b] = 1
        except:
            pass
    return x


def replace_val(df_groupped, col):
    mask = df_groupped.loc[df_groupped['bill_id'] == col].index[df_groupped[col].loc[df_groupped['bill_id'] == col] == 1]

    min_dt = df_groupped.iloc[min(mask)]['date']
    max_dt = df_groupped.iloc[max(mask)]['date']

    df_groupped[col].loc[(df_groupped['date'] < min_dt)] = 0
    df_groupped[col].loc[(df_groupped['date'] >= min_dt) & (df_groupped['date'] <= max_dt)] = 1
    df_groupped[col].loc[(df_groupped['date'] > max_dt)] = 2
    return df_groupped


def reduce_cols(row):
    col_id = row['bill_id']
    row['val'] = row[col_id]
    return row


df = df.sort_values(by='date')
df = df[pd.notnull(df['bill_id'])]
bills = list(set(df['bill_id'].tolist()))

for col in bills:
    df[col] = 9

df_groupped = df.groupby('date')
df_groupped = df_groupped.apply(lambda x: map_to_one(x))
df_groupped = df_groupped.reset_index()
df_groupped.to_csv('groupped_in.csv', index=False)
df_groupped = pd.read_csv('groupped_in.csv')

for col in bills:
    df_groupped = replace_val(df_groupped, col)

df_groupped = df_groupped.apply(lambda row: reduce_cols(row), axis=1)
df_groupped.to_csv('out.csv', index=False)

cols = [x for x in df_groupped.columns if x not in ['index', 'date', 'bill_id', 'val']]
col_dt = sorted(list(set(df_groupped['date'].tolist())))
dd = {x:[0]*len(col_dt) for x in cols}
dd['dt'] = col_dt
df_mapped = pd.DataFrame(data=dd).set_index('dt').reset_index()

for c in cols:
    counter += 1
    df_mapped = df_mapped.apply(lambda row: map_values(row, df_groupped[[c, 'bill_id', 'date']], c), axis=1)

EDIT:

The answer from Joe is fine but I decided to go instead with other option:

  1. get date.min() and date.max()
  2. df_groupped = groupby bill_id
  3. df_groupped apply function in which I check date_x.min() and date_x.max() per group and I do compare date.min() with date_x.min() and date.max() with date_x.max() and in such way I know where is 0, 1 and 2 :)
1
  • If you used another solution, dont post it in the text of the question, but rather write it as answer. In general if you found any answer useful, you can upvote it: meta.stackexchange.com/questions/173399/… Commented Nov 14, 2018 at 13:48

1 Answer 1

1

I hope i understood which is your desired output.

First make a crosstab:

df1 = pd.crosstab(df['dt'],df['bill_id'])

Output:

    bill_id     bill_1  bill_2  bill_3  bill_4
dt                                        
01-01-2017       1       1       0       0
02-01-2017       1       0       1       0
03-01-2017       0       0       0       2

From now you start to modify the df in this way: Create a copy that you will use as mask

df2 = df1.copy()

Replace the 0 after 1(or the other values>1):

for col in df2.columns:
    df2[col] = df2[col].replace(to_replace=0, method='ffill')

    bill_id     bill_1  bill_2  bill_3  bill_4
dt                                        
01-01-2017       1       1       0       0
02-01-2017       1       1       1       0
03-01-2017       1       1       1       2

Now subtract the 2 df:

df3 = df1-df2

These are the changed values:

    bill_id     bill_1  bill_2  bill_3  bill_4
dt                                        
01-01-2017       0       0       0       0
02-01-2017       0      -1       0       0
03-01-2017      -1      -1      -1       0

replace these values with 2:

for col in df3.columns:
    df3[col] = df3[col].replace(-1, 2)

Go back to the first df1 and change the values bigger than 1 to 1:

for col in df1.columns:
    df1[col] = df1[col].apply(lambda x: x if x < 2 else 1)

and in the end you sum this last df with df3:

df_add = df1.add(df3, fill_value=0)

Output:

    bill_id     bill_1  bill_2  bill_3  bill_4
dt                                        
01-01-2017       1       1       0       0
02-01-2017       1       2       1       0
03-01-2017       2       2       2       1

To complete, replace negative values:

for col in df_add.columns:
    df_add[col] = df_add[col].apply(lambda x: 2 if x < 0 else x)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.