3

I've been given a few sets of MS-Excel worksheets with a lot of nested data in areas, and I have researching for a few hours looking for a way to reduce each 'id' row to single rows. Specifically merging 'Step ID', 'Install Steps', and 'Expected step' into single lines with some formatting.

Here is shortened simple of the data within the Excel sheets I need to convert.

Name ID Host Step ID Install Step Expected step Extra
Test1 4 Cat 1 Move x to y x is with y x will protest
2 move x away from y x and y are not together y will protest
Test2 5 Dog 1 remove x from tank y is alone
2 Drop duplicate of y, y2 in tank y1 is not alone y1 will protest
3 Drop more duplicates of y into tank, y3 and y4 y1 and y2 will protest
test 3 6 Dog 1 empty tank nothing is in tank

And I am looking to transform this excel sheet into the following

Name ID Host Install Step Expected step Extra
Test1 4 Cat 1 - Move x to y
2 - move x away from y
1 - x is with y
2 - x and y are not together
1 - x will protest
2 - y will protest
Test2 5 Dog 1 - remove x from tank
2 - Drop duplicate of y, y2 in tank
3 - Drop more duplicates of y into tank, y3 and y4
1 - y is alone
2 - y1 is not alone

2 - y1 will protest <br / > y1 and y2 will protest
Test3 6 Dog 1 - empty tank 1 - nothing is in tank

I have testing a few of the other Stackoverflow questions and repsonses for pandas, but the few that closely match my need just fill in the empty areas with duplicate data.

0

2 Answers 2

2

If you melt() the dataframe:

melt = df.melt(['Name', 'ID', 'Host', 'Step ID']).ffill()

#      Name   ID  Host  Step ID      variable               value
# 0   Test1  4.0   Cat        1  Install Step         Move x to y
# 1   Test1  4.0   Cat        2  Install Step  move x away from y
# ...
# 16  Test2  5.0   Dog        3         Extra     y1 will protest
# 17  Test3  6.0   Dog        1         Extra     y1 will protest

You can combine the Step ID and value columns in one shot:

melt.value = melt['Step ID'].astype(str) + ' - ' + melt.value
melt = melt.drop('Step ID', axis=1)

#      Name   ID  Host      variable                   value
# 0   Test1  4.0   Cat  Install Step         1 - Move x to y
# 1   Test1  4.0   Cat  Install Step  2 - move x away from y
# ...
# 16  Test2  5.0   Dog         Extra     3 - y1 will protest
# 17  Test3  6.0   Dog         Extra     1 - y1 will protest

Then join each group's value list together with \n and unstack() to pivot back to the wide table:

melt.groupby(['Name', 'ID', 'Host', 'variable']).agg('\n'.join).unstack()
Name ID Host Expected Step Extra Install Step
0 Test1 4.0 Cat 1 - x is with y\n2 - x and y are not together 1 - x will protest\n2 - y will protest 1 - Move x to y\n2 - move x away from y
1 Test2 5.0 Dog 1 - y is alone\n2 - y1 is not alone\n3 - y1 an... 1 - y will protest\n2 - y1 will protest\n3 - y... 1 - remove x from tank\n2 - Drop duplicate of ...
2 Test3 6.0 Dog 1 - nothing is in tank 1 - y1 will protest 1 - empty tank
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for all the help. Looks like melt will resolve it nicely.
Great, no problem.
1

For the formatting bit:

df = df.ffill()  # forward fill the empty areas
step_id_str = df['Step ID'].astype(str)
for col in ['Install Step','Expected Step','Extra']:
    df[col] = step_id_str + ' - ' + df[col]

For merging the rows:

group = df.groupby('ID')
for col in ['Install Step','Expected Step','Extra']:
    df[col] = group[col].transform(lambda s: '\n'.join(s))
df = df.drop('Step ID', axis=1)
df = df.drop_duplicates()
df

1 Comment

I was able to use part of you answer also. Thank you for the help

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.