3

this is my first question in Stack Overflow. I will water down the problem that I have at the moment. I am trying to clean a dataset for a User-based collaborative filtering recommendendation system.

Here's an oversimplication of the dataset I have with all the use-cases

data = pd.DataFrame({'name':    ['John' ,'Jane' ,'Joe'  ,'John' ,'Jane' ,   'Joe'],
                     'movie1':  [''     , 'bad' , 'avg' , 'nice', ''    , ''    ],
                     'movie2':  ['good' , ''    , ''    , ''    , 'poor', ''    ],
                     'movie3':  ['bad'  , ''    , 'good', ''    , ''    , ''    ],
                     })

From for how I sourced my data I know that even though John, Jane and Joe might repeat themselves any amount of times, they will never have more than one rating for any given movie.

I want to be able to aggregate repeated users into a single row so that my output in terminal looks like this:

 name movie1 movie2 movie3
0  John   nice  good    bad
1  Jane    bad  poor            
2   Joe    avg          good

This problem is very similar to this question but the difference is that I'm dealing with string objects and not numbers, therefore I can't use aggregation functions How can I "merge" rows by same value in a column in Pandas with aggregation functions?

My real dataset has 4260 columns and 24169 rows, therefore I'm not capable of applying something like df.groupby(['name','month'])['text'].apply(','.join).reset_index() because it's not possible to write down all the column names. From: Concatenate strings from several rows using Pandas groupby

I tried following the answers of this question but I either got errors or my dataframe stayed the same. Pandas | merge rows with same id

Even though logically it didn't make sense, I tried using data.groupby('name').ffill().drop_duplicates('name' ,keep='last') and I got the following error = KeyError: Index(['name'], dtype='object')

Passing False to as_index inside the groupby gave me the exact same error data.groupby('name', as_index=False).ffill().reset_index().drop_duplicates('name', keep='last')

The closest that I've gotten has been this: data = data.groupby('name', as_index=False).apply(lambda x: x.fillna(method='ffill').iloc[0])

The output that it gives me only deletes repeated rows but doesn't add the ratings to the leftover data:

   name movie1 movie2 movie3
0  Jane    bad              
1   Joe    avg          good
2  John          good    bad

Complete Code:

import pandas as pd

data = pd.DataFrame({'name':    ['John' ,'Jane' ,'Joe'  ,'John' ,'Jane' ,   'Joe'],
                     'movie1':  [''     , 'bad' , 'avg' , 'nice', ''    , ''    ],
                     'movie2':  ['good' , ''    , ''    , ''    , 'poor', ''    ],
                     'movie3':  ['bad'  , ''    , 'good', ''    , ''    , ''    ],
                     })
print('Baseline:')
print(data.head())

#data = data.join(data['name'])
#data.groupby('name').ffill().drop_duplicates('name' ,keep='last')
#data.groupby('name', as_index=False).ffill().reset_index().drop_duplicates('name', keep='last')
data = data.groupby('name', as_index=False).apply(lambda x: x.fillna(method='ffill').iloc[0])
#data.groupby('name').ffill().drop_duplicates('name', keep='last')
#data =  data.groupby(['name'])[['movie1','movie2','movie3']].apply('.'.join).reset_index()
print('End result:')
print(data.head())
1

1 Answer 1

2

IIUC, you can use groupby_first. The trick is to replace empty string by nan then rollback after selecting the first valid value:

>>> data.replace('', np.nan).groupby('name', as_index=False, sort=False).first().fillna('')

   name movie1 movie2 movie3
0  John   nice   good    bad
1  Jane    bad   poor       
2   Joe    avg          good
Sign up to request clarification or add additional context in comments.

1 Comment

Yes! Thanks a ton I only got a warning but it did the trick! "PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy() user_data = user_data.replace('',np.nan).groupby('user', as_index=False,sort=False).first().fillna('')"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.