4

I didn't really know how to give a good descriptive title, but here's my question. Let's consider a DataFrame df:

     col_name
0    Category1
1     item1()
2     item2()
3    Category2
4     item3()
5     item4()
6     item5()

I need to get this:

     categories   items
0     Category1   item1
1     Category1   item2
2     Category2   item3
3     Category2   item4
4     Category2   item5

But categories could be continents and items could be countries. I know that all the items have () with an expression inside, so I can easily provide a boolean mask and then create a list of categories with:

msk = df[~df['col_name'].str.contains('[^A-Za-z\s]')]['col_name'].tolist()

But now, now I'm stuck. Could you please give me any piece of advice?

2 Answers 2

6

Let us do startswith find the category row and create the other column with ffill

df['category']=df.col_name.mask(df.col_name.str.endwith('Category')).ffill()
#df['category']=df.col_name.mask(df.col_name.str.endswith(')')).ffill()
df=df[df.category!=df.col_name]
df
Out[241]: 
  col_name   category
1  item1()  Category1
2  item2()  Category1
4  item3()  Category2
5  item4()  Category2
6  item5()  Category2
Sign up to request clarification or add additional context in comments.

2 Comments

That will only work if his continents all started with a similar substring which i doubt will be the case!
@Mit check the mark df['category']=df.col_name.mask(df.col_name.str.endswith(')')).ffill()
4

Here is necessary specify how distinguish non category or category values. In these solution are tested values if ( in data, then replace these values by missing values and forward filling them, then replace () and last filter by original mask:

m = df['col_name'].str.contains('(', regex=False)
df['categories'] = df['col_name'].mask(m).ffill()
df['items'] = df.pop('col_name').str.replace('[\(\)]', '')
df = df[m]

print (df)
  categories  items
1  Category1  item1
2  Category1  item2
4  Category2  item3
5  Category2  item4
6  Category2  item5

With your mask with added digits is solution changed by:

m = df['col_name'].str.contains('[^A-Za-z0-9\s]')
df['categories'] = df['col_name'].mask(m).ffill()
df['items'] = df.pop('col_name').str.replace('[\()]', '')
df = df[m]

print (df)
  categories  items
1  Category1  item1
2  Category1  item2
4  Category2  item3
5  Category2  item4
6  Category2  item5

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.