Create index/rows from distinct values in a column DataFrame

Question

I didn't really know how to give a good descriptive title, but here's my question. Let's consider a DataFrame df:

     col_name
0    Category1
1     item1()
2     item2()
3    Category2
4     item3()
5     item4()
6     item5()

I need to get this:

     categories   items
0     Category1   item1
1     Category1   item2
2     Category2   item3
3     Category2   item4
4     Category2   item5

But categories could be continents and items could be countries. I know that all the items have () with an expression inside, so I can easily provide a boolean mask and then create a list of categories with:

msk = df[~df['col_name'].str.contains('[^A-Za-z\s]')]['col_name'].tolist()

But now, now I'm stuck. Could you please give me any piece of advice?

BENY · Accepted Answer · 2020-05-06 15:13:07Z

6

Let us do startswith find the category row and create the other column with ffill

df['category']=df.col_name.mask(df.col_name.str.endwith('Category')).ffill()
#df['category']=df.col_name.mask(df.col_name.str.endswith(')')).ffill()
df=df[df.category!=df.col_name]
df
Out[241]: 
  col_name   category
1  item1()  Category1
2  item2()  Category1
4  item3()  Category2
5  item4()  Category2
6  item5()  Category2

edited May 6, 2020 at 15:13

answered May 6, 2020 at 15:09

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mit Over a year ago

That will only work if his continents all started with a similar substring which i doubt will be the case!

BENY Over a year ago

@Mit check the mark df['category']=df.col_name.mask(df.col_name.str.endswith(')')).ffill()

jezrael · Accepted Answer · 2020-05-06 15:16:49Z

Here is necessary specify how distinguish non category or category values. In these solution are tested values if ( in data, then replace these values by missing values and forward filling them, then replace () and last filter by original mask:

m = df['col_name'].str.contains('(', regex=False)
df['categories'] = df['col_name'].mask(m).ffill()
df['items'] = df.pop('col_name').str.replace('[\(\)]', '')
df = df[m]

print (df)
  categories  items
1  Category1  item1
2  Category1  item2
4  Category2  item3
5  Category2  item4
6  Category2  item5

With your mask with added digits is solution changed by:

m = df['col_name'].str.contains('[^A-Za-z0-9\s]')
df['categories'] = df['col_name'].mask(m).ffill()
df['items'] = df.pop('col_name').str.replace('[\()]', '')
df = df[m]

print (df)
  categories  items
1  Category1  item1
2  Category1  item2
4  Category2  item3
5  Category2  item4
6  Category2  item5

Collectives™ on Stack Overflow

Create index/rows from distinct values in a column DataFrame

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related