Duplicate rows in a DataFrame based on column values, outputting column name

Question

I have a DataFrame that looks something like this:

data = [
    ['item 1', 'Some text', 0.0, 1, 0.25],
    ['item 2', 'Some other text', 0.5, 0.0, 0.0],
    ['item 3', 'Etc.', 0.0, 0.25, 0.0],
]

df = pd.DataFrame(data, columns=['item_name', 'description', 'class1', 'class2', 'class3'])

print(df)

  item_name      description  class1  class2  class3
0    item 1        Some text     0.0    1.00    0.25
1    item 2  Some other text     0.5    0.00    0.00
2    item 3             Etc.     0.0    0.25    0.00

I would like to duplicate each row for each time a value greater 0 is found in columns class1 to class3, outputting item_name, description, and the class_name. Expected result is:

  item_name      description    class
0    item 1        Some text   class2
1    item 1        Some text   class3
2    item 2  Some other text   class1
3    item 3             Etc.   class2

I managed to get some output that goes into the right direction by using iterrows, however I am only able to access the class value, and not its name:

data_transf = []
for index, row in df.iterrows():
   for col in row.loc['class1':'class3']:
        if col > 0: data_transf.append(
            [row['item_name'],
             row['description'],
             col
            ])

df_new = pd.DataFrame(data_transf, columns=['item_name', 'description', 'class'])

print(df_new)

  item_name      description  class
0    item 1        Some text   1.00
1    item 1        Some text   0.25
2    item 2  Some other text   0.50
3    item 3             Etc.   0.25

The problem is that col is a float and I can't find a way to access its index position to retrieve the class name. How can this be achieved? Perhaps there is a more elegant way to do this using built-ins or coprehensions?

pandas.pydata.org/pandas-docs/stable/reference/api/…

user1558604
– user1558604

2020-07-05 17:06:21 +00:00
Commented Jul 5, 2020 at 17:06 — user1558604
– user1558604, Commented Jul 5, 2020 at 17:06

akuiper · Accepted Answer · 2020-07-05 17:14:42Z

3

You can do this by transforming the data frame to long format with stack and then filter out values that are greater than 0:

# stack and filter
ldf = df.set_index(['item_name', 'description']).stack()[lambda x: x > 0]

# reset index
ldf = ldf.reset_index().drop(0, axis=1).rename(columns={'level_2': 'class'})

print(ldf)

#  item_name      description   class
#0    item 1        Some text  class2
#1    item 1        Some text  class3
#2    item 2  Some other text  class1
#3    item 3             Etc.  class2

Play

answered Jul 5, 2020 at 17:14

akuiper

216k33 gold badges363 silver badges380 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ch3steR Over a year ago

One liner of the answer df.set_index(['item_name', 'description']).stack().to_frame('val').query("val>0").reset_index().drop(columns='val')

Ch3steR · Accepted Answer · 2020-07-05 17:30:35Z

1

Alternative using df.melt

(df.melt(id_vars=['item_name', 'description'],var_name='class').
    query("value>0").drop(columns='value'))

  item_name      description   class
1    item 2  Some other text  class1
3    item 1        Some text  class2
5    item 3             Etc.  class2
6    item 1        Some text  class3

answered Jul 5, 2020 at 17:30

Ch3steR

20.8k4 gold badges34 silver badges66 bronze badges

Collectives™ on Stack Overflow

Duplicate rows in a DataFrame based on column values, outputting column name

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related