Pandas groupby result into a dataframe

Question

I'm working on a data that contains duplicates. If "similarity_index" of the row is equal to another row, that means they are duplicates. I'm trying to merge this duplicates.

Here is my DataFrame:

           ad    soyad similarity_index
0       hakan  özdemir                0
1       hasan    yaman                1
2        naci    şenli                2
3      naciye      şen                2
4       osman    uygur                3
5        elif    sözen                4
6        irem   derici                5

Here is what I tried to do:

test_df.set_index("similarity_index").sort_index()

Here is the output:

                          ad    soyad
similarity_index                     
0                      hakan  özdemir
0                 hakan utku  özdemir
1                      hasan    yaman
2                       naci    şenli
2                     naciye      şen
3                      osman    uygur
4                       elif    sözen
5                       irem   derici
5                       irem   delici
6                       hako  özdemir

Here is what I want:

                          ad    soyad
similarity_index                     
0                      hakan  özdemir
                  hakan utku  özdemir
1                      hasan    yaman
2                       naci    şenli
                      naciye      şen
3                      osman    uygur
4                       elif    sözen
5                       irem   derici
                        irem   delici
6                       hako  özdemir

With this I'm trying to accomplish selecting duplicate rows with the same index. I tried groupby() and pivot_table(). But I couldn't find a proper way to do it.

John · Accepted Answer · 2018-08-22 09:56:36Z

1

What you want is actually a customized function of the default indexing function of pandas.

import pandas as pd
def index_duplicates_with_same_index(df, index, column_name):
    return df[df[column_name]==index]
df = pd.DataFrame([['hakan',  'özdemir', 0], ['hasan',  'yaman', 1],['naci',  'şenli', 2],['naciye',  'şen', 2]], columns = ['ad','soyad','similarity_index'])
print(df)

print(index_duplicates_with_same_index(df, 2, 'similarity_index'))

answered Aug 22, 2018 at 9:56

John

3983 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pandas groupby result into a dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related