1

I'm working on a data that contains duplicates. If "similarity_index" of the row is equal to another row, that means they are duplicates. I'm trying to merge this duplicates.

Here is my DataFrame:

           ad    soyad similarity_index
0       hakan  özdemir                0
1       hasan    yaman                1
2        naci    şenli                2
3      naciye      şen                2
4       osman    uygur                3
5        elif    sözen                4
6        irem   derici                5

Here is what I tried to do:

test_df.set_index("similarity_index").sort_index()

Here is the output:

                          ad    soyad
similarity_index                     
0                      hakan  özdemir
0                 hakan utku  özdemir
1                      hasan    yaman
2                       naci    şenli
2                     naciye      şen
3                      osman    uygur
4                       elif    sözen
5                       irem   derici
5                       irem   delici
6                       hako  özdemir

Here is what I want:

                          ad    soyad
similarity_index                     
0                      hakan  özdemir
                  hakan utku  özdemir
1                      hasan    yaman
2                       naci    şenli
                      naciye      şen
3                      osman    uygur
4                       elif    sözen
5                       irem   derici
                        irem   delici
6                       hako  özdemir

With this I'm trying to accomplish selecting duplicate rows with the same index. I tried groupby() and pivot_table(). But I couldn't find a proper way to do it.

1 Answer 1

1

What you want is actually a customized function of the default indexing function of pandas.

import pandas as pd
def index_duplicates_with_same_index(df, index, column_name):
    return df[df[column_name]==index]
df = pd.DataFrame([['hakan',  'özdemir', 0], ['hasan',  'yaman', 1],['naci',  'şenli', 2],['naciye',  'şen', 2]], columns = ['ad','soyad','similarity_index'])
print(df)

enter image description here

print(index_duplicates_with_same_index(df, 2, 'similarity_index'))

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.