3

I have a pandas dataframe and I am cleaning the data using applymap on all elements using a custom function and storing the cleaned values in separate column.

tag0           tag1                         tag2            tag3
1.Kharif       3.Pest and Disease Management    4.Grasshopper   2.Paddy
1.Kharif       2.Brinjal                    3.Crop Growth Management    
1.Kharif       3.Pest and Disease Management    4.Caterpillar   2.Black Gram
1.Kharif       3.Pest and Disease Management    4.Caterpillar   2.Cotton    

The above is a part of the entire dataframe.

I wrote the function written below.

def tag_cleaner(tag):
'''
this function takes an argument called tag and checks if it starts with 1 then
 it puts it in a new column called season and so on. It is performed row-wise 
and at the end the dataframe will have columnar values
'''
    if tag.startswith('1'):
        df_tags['season'] = tag
    elif tag.startswith('2'):
        df_tags['crop'] = tag
    elif tag.startswith('3'):
        df_tags['maintopic'] = tag
    elif tag.startswith('4'):
        df_tags['subtopic'] = tag
    elif tag.startswith('5'):
        df_tags['issue'] = tag
    else:
        return tag

and then applied the applymap function

df_tags.applymap(tag_cleaner)

I expect the output to be like this

season          crop            maintopic                      subtopic
1. Kharif       2.Paddy         3. Pest and Disease Management     4. Grasshopper
1. Kharif       2. Brinjal      3. Crop Growth Management        NA
1. Kharif       2. Black Gram   3. Pest and Disease Management     4. Catterpillar
1. Kharif       2. Cotton       3. Pest and Disease Management     4. Catterpillar

The command is able to make the new columns as I wanted, but in all the columns there is same value. It looks like this. This is the same value copied in the entire dataframe.

season    crop    maintopic                 subtopic
1.Kharif  2.Paddy 3.Crop Growth Management  4. Caterpillar 

But I am getting this error

AttributeError: ("'float' object has no attribute 'startswith'", 'occurred at index tag2')

I am a beginner and I do not know where am I mistake. I think I am making a logical error in the function that I have defined which is why the last run of the function copies the value to the entire series of the dataframe. Please help.

2
  • the error means that there are some float values in column tag2, obviously further down in your entire dataframe. Maybe you could just change it to if str(tag).startswith etc. in your function. Commented Aug 30, 2019 at 5:20
  • yes, i did that also but when i do that then it gives me the error str object has no attribute str Commented Aug 30, 2019 at 5:22

1 Answer 1

2

Use:

#reshape DataFrame with remove original columns names
df = df.stack().to_frame('a').reset_index(level=1, drop=True).reset_index()
#get values before .
df['b'] = df['a'].str.split('.').str[0]
#dictionary for new columns names
d = {'1': 'season', '2': 'crop', '3': 'maintopic', '4':'subtopic','5':'issue'}
#pivoting and get new columns names
df = df.pivot('index','b','a').rename(columns=d).rename_axis(None, axis=1).rename_axis(None)

print (df)
     season          crop                      maintopic       subtopic
0  1.Kharif       2.Paddy  3.Pest and Disease Management  4.Grasshopper
1  1.Kharif     2.Brinjal       3.Crop Growth Management            NaN
2  1.Kharif  2.Black Gram  3.Pest and Disease Management  4.Caterpillar
3  1.Kharif      2.Cotton  3.Pest and Disease Management  4.Caterpillar

EDIT: Error means here there is multiple values with same number per rows, solution is use pivot_table with aggregtae function join:

print (df)
       tag0                           tag1                      tag2  \
0  1.Kharif  1.Pest and Disease Management             4.Grasshopper   
1  1.Kharif                      2.Brinjal  3.Crop Growth Management   
2  1.Kharif  3.Pest and Disease Management             4.Caterpillar   
3  1.Kharif  3.Pest and Disease Management             4.Caterpillar   

           tag3  
0       2.Paddy  
1           NaN  
2  2.Black Gram  
3      2.Cotton  

df = df.stack().to_frame('a').reset_index(level=1, drop=True).reset_index()
df['b'] = df['a'].str.split('.').str[0]
d = {'1': 'season', '2': 'crop', '3': 'maintopic', '4':'subtopic','5':'issue'}

df = df.pivot_table(index='index',columns='b',values='a', aggfunc=','.join).rename(columns=d).rename_axis(None, axis=1).rename_axis(None)

print (df)
                                   season          crop  \
0  1.Kharif,1.Pest and Disease Management       2.Paddy   
1                                1.Kharif     2.Brinjal   
2                                1.Kharif  2.Black Gram   
3                                1.Kharif      2.Cotton   

                       maintopic       subtopic  
0                            NaN  4.Grasshopper  
1       3.Crop Growth Management            NaN  
2  3.Pest and Disease Management  4.Caterpillar  
3  3.Pest and Disease Management  4.Caterpillar 
Sign up to request clarification or add additional context in comments.

1 Comment

I am getting this error ValueError: Index contains duplicate entries, cannot reshape

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.