How to find duplicate values (not rows) in an entire pandas dataframe?

Question

Consider this dataframe.

df = pd.DataFrame(data={'one': list('abcd'),
                        'two': list('efgh'),
                        'three': list('ajha')})
  one two three
0   a   e     a
1   b   f     j
2   c   g     h
3   d   h     a

How can I output all duplicate values and their respective index? The output can look something like this.

Andrej Kesely · Accepted Answer · 2021-09-17 21:02:42Z

2

Try .melt + .duplicated:

x = df.reset_index().melt("index")
print(
    x.loc[x.duplicated(["value"], keep=False), ["index", "value"]]
    .reset_index(drop=True)
    .rename(columns={"index": "id"})
)

Prints:

   id value
0   0     a
1   3     h
2   0     a
3   2     h
4   3     a

answered Sep 17, 2021 at 21:02

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user2962397 Over a year ago

Many replied quickly but I appreciate the simplicity of your approach. In fact, makes me feel silly for not thinking of this. Thanks!

Andrej Kesely Over a year ago

@user2962397 Thanks, happy coding! :)

Henry Ecker · Accepted Answer · 2021-09-17 21:16:56Z

2

We can stack the DataFrame, use Series.loc to keep only where value is Series.duplicated then Series.reset_index to convert to a DataFrame:

new_df = (
    df.stack()  # Convert to Long Form
        .droplevel(-1).rename_axis('id')  # Handle MultiIndex
        .loc[lambda x: x.duplicated(keep=False)]  # Filter Values
        .reset_index(name='value')  # Make Series a DataFrame
)

new_df:

   id value
0   0     a
1   0     a
2   2     h
3   3     h
4   3     a

edited Sep 17, 2021 at 21:16

answered Sep 17, 2021 at 21:02

Henry Ecker♦

35.8k19 gold badges48 silver badges67 bronze badges

Comments

mozway · Accepted Answer · 2021-09-17 21:18:22Z

2

I used here melt to reshape and duplicated(keep=False) to select the duplicates:

(df.rename_axis('id')
   .reset_index()
   .melt(id_vars='id')
   .loc[lambda d: d['value'].duplicated(keep=False), ['id','value']]
   .sort_values(by='id')
   .reset_index(drop=True)
 )

Output:

    id value
0   0     a
1   0     a
2   2     h
3   3     h
4   3     a

edited Sep 17, 2021 at 21:18

answered Sep 17, 2021 at 21:04

mozway

267k13 gold badges56 silver badges106 bronze badges

2 Comments

Henry Ecker Over a year ago

I believe loc with lambda would have less overhead than assign + query + drop. df.reset_index().melt(id_vars='index').loc[lambda d: d['value'].duplicated(keep=False), ['index','value']] (although that would be almost identical to Andrej's answer)

mozway Over a year ago

Yes you're right, I just didn't think of it at the time ;)

Collectives™ on Stack Overflow

How to find duplicate values (not rows) in an entire pandas dataframe?

3 Answers 3

2 Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related