1

I have a sample DataFrame as follows:

track_id track_date status status_info
track_1 2021-01-01 approved None
track_2 2021-01-02 None accredited
track_3 2021-01-03 approved accredited
track_4 2021-01-04 approved approved
track_5 2021-01-05 approved|approved accredited|cancelled
track_6 2021-01-06 None accredited|cancelled

And I need to split both status and status_info into rows, so it gives me an output similar to the one below:

track_id track_date status status_info
track_1 2021-01-01 approved None
track_2 2021-01-02 None accredited
track_3 2021-01-03 approved accredited
track_4 2021-01-04 approved approved
track_5 2021-01-05 approved accredited
track_5 2021-01-05 approved cancelled
track_6 2021-01-06 None accredited
track_6 2021-01-06 None cancelled

I have already tried the code below, using this answer in another question as reference:

# splitting string values into lists
new_status = df['status'].str.split('|', expand=True).stack().reset_index(level=1, drop=True)
new_status_info = df['status_info'].str.split('|', expand=True).stack().reset_index(level=1, drop=True)

# generating a temporary DataFrame to join later (error here)
df_split = pd.concat([new_status, new_status_info], axis=1, keys=['status', 'status_info'])

# then, we join both DataFrames
df.drop(columns=['status','status_info'], axis=1).join(df_split).reset_index(drop=True)

But it gives me a ValueError:

ValueError: cannot reindex from a duplicate axis

When I modify .reset_index(level=1, drop=True) to .reset_index(drop=True) on the split step, the join operation only brings me one of the values, and not the expected two:

track_id track_date status status_info
track_1 2021-01-01 approved None
track_2 2021-01-02 None accredited
track_3 2021-01-03 approved accredited
track_4 2021-01-04 approved approved
track_5 2021-01-05 approved cancelled
track_6 2021-01-06 None accredited

2 Answers 2

2

You can melt, explode, pivot:

cols = ['track_id', 'track_date']

(df.melt(cols, ignore_index=False).reset_index()
   .assign(value=lambda d: d['value'].str.split('|'))
   .explode('value')
   .assign(n=lambda d: d.groupby(level=0).cumcount())
   .pivot(index=cols+['index', 'n'], columns='variable', values='value')
   .reset_index(cols).droplevel('n')
   .groupby(level=0).ffill()
   .rename_axis(index=df.index.name, columns=df.columns.name)
)

Output:

  track_id  track_date    status status_info
0  track_1  2021-01-01  approved        None
1  track_2  2021-01-02      None  accredited
2  track_3  2021-01-03  approved  accredited
3  track_4  2021-01-04  approved    approved
4  track_5  2021-01-05  approved  accredited
4  track_5  2021-01-05  approved   cancelled
5  track_6  2021-01-06      None  accredited
5  track_6  2021-01-06      None   cancelled
Sign up to request clarification or add additional context in comments.

Comments

1

You can try to use itertools.zip_longest:

from itertools import zip_longest

df['status'] = df['status'].str.split('|')
df['status_info'] = df['status_info'].str.split('|')

# make `status`, `status_info` columns the same length
df[['status', 'status_info']] = df[['status', 'status_info']].apply(lambda x: [*zip(*zip_longest(x['status'], x['status_info']))], axis=1, result_type='expand')
print(df.explode(['status', 'status_info']))

Prints:

  track_id  track_date    status status_info
0  track_1  2021-01-01  approved        None
1  track_2  2021-01-02      None  accredited
2  track_3  2021-01-03  approved  accredited
3  track_4  2021-01-04  approved    approved
4  track_5  2021-01-05  approved  accredited
4  track_5  2021-01-05  approved   cancelled
5  track_6  2021-01-06      None  accredited
5  track_6  2021-01-06      None   cancelled

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.