6

I am quite new to pandas. I need to aggregate 'Names' if they have the same name and then make an average for 'Rating' and 'NumsHelpful' (without counting NaN). 'Review' should get concatenated whilst 'Weight(Pounds)'should remain untouched:

col names: ['Brand', 'Name', 'NumsHelpful', 'Rating', 'Weight(Pounds)', 'Review']

Name             'Brand'                             'Name'
1534             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1535             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1536             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1537             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1538             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1539             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1540             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   

        'NumsHelpful'     'Rating'       'Weight'
1534          NaN            2              4.5   
1535          NaN            2              4.5   
1536          NaN            NaN            4.5   
1537          NaN            NaN            4.5   
1538          2              NaN            4.5   
1539          3              5              4.5   
1540          5              NaN            4.5   

                        'Review'
1534                                     Yummy - Delish  
1535  The best Bloody Mary mix! - The best Bloody Ma...  
1536  Best Taste by far - I've tried several if not ...  
1537  Best bloody mary mix ever - This is also good ...  
1538  Outstanding - Has a small kick to it but very ...  
1539   OMG! So Good! - Spicy, terrific Bloody Mary mix!  
1540                      Good stuff - This is the best  

So the output should be something like this:

 'Brand'                'Name'                   'NumsHelpful'    'Rating' 
Zing Zang    Zing Zang Bloody Mary Mix, 32 fl oz     3.33             3

 'Weight'               'Review'
   4.5      Review1 / Review2 / ... / ReviewN

How shall I procede? Thanks.

3 Answers 3

10

Use DataFrameGroupBy.agg with dictionary of columns and aggregated functions - columns Weight and Brand are agregated by first - it means first values per groups:

d = {'NumsHelpful':'mean', 
     'Review':'/'.join, 
     'Weight':'first',
     'Brand':'first', 
     'Rating':'mean'}
df = df.groupby('Name').agg(d).reset_index()
print (df)
                                  Name  NumsHelpful  \
0  Zing Zang Bloody Mary Mix, 32 fl oz     3.333333   

                                              Review  Weight      Brand  \
0  Yummy - Delish/The best Bloody Mary mix! - The...     4.5  Zing Zang   

   Rating  
0     3.0  

Also in pandas 0.23.1 pandas version get:

FutureWarning: 'Name' is both an index level and a column label. Defaulting to column, but this will raise an ambiguity error in a future version

Solution is remove index name Name:

df.index.name = None

Or:

df = df.rename_axis(None)

Another possible solution is not aggregate by first, but add these column to groupby:

d = {'NumsHelpful':'mean',  'Review':'/'.join, 'Rating':'mean'}
df = df.groupby(['Name', 'Weight','Brand']).agg(d).reset_index()

Both solutions return same output if per groups there are same values.

EDIT:

If need convert string (object) column to numeric first try convert by astype:

df['Weight(Pounds)'] = df['Weight(Pounds)'].astype(float)

And if it failed use to_numeric with parameter errors='coerce' for convert non parseable strings to NaNs:

df['Weight(Pounds)'] = pd.to_numeric(df['Weight(Pounds)'], errors='coerce')
Sign up to request clarification or add additional context in comments.

3 Comments

I tried using your suggestions but the my error still persists :/
@StefanoPozzi - are data confidental?
No but managed to fix it :) had to cast columns that I wanted to contain strings to the string type. Thx for help!
2

You can aggregate with a different function for each column using groupby + agg, together with a dictionary mapping series to functions. For example:

d = {'Rating': 'mean',
     'NumsHelpful': 'mean',
     'Review': ' | '.join,
     'Weight(Pounds)': 'first'}

res = df.groupby('Name').agg(d)

8 Comments

where 'mean' is np.mean or simply the string? Coz when I try to run the code it gives me a 'TypeError: sequence item 0: expected str instance, float found'
this is the output 'Brand object - Name object - NumsHelpful float64 - Rating float64 - Weight(Pounds) object - Review object' does this mean that weight has some non float arguments?
@StefanoPozzi - What is your pandas version?
@StefanoPozzi - object means it is obviously string column.
version = 0.22.0 / I tried using this -> pd.to_numeric(df['Weight(Pounds)']) but doesn't seem to make any change (still object)
|
0

I've seen this happen because when creating the index you chose to keep the column in the list, usually the column goes to the index is excluded from the table, so do the following:

# dataset_A was created with the option # drop = False
df_dataset_new = dataset_A.copy()
index_df = ['month', 'scop']
# dataset_new will be create`enter code here`d with the option # drop = True
df_dataset_new.set_index(index_df, drop=True, inplace=True, verify_integrity=True) 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.