How can "unique" show duplicate values in a dataframe?

Question

Background: I am very confused by my dataframe (df), which when I do some simple analyses is producing random rows for a specific value within my column named 'ID' (specifically, when ID == 42). As a result, I have started to do some troubleshooting.

When I try to list all the rows where ID = 42, I do:

data=df.loc[df['ID'] == 42]

And the rows look correct in this new variable called 'data'. However, when I scroll manually through the original dataframe df (e.g., in the Variable Explorer on Spyder), I can see there are way more rows for ID=42 that are not being printed to 'data'.

Then, to double check why the 'ID' values are showing this weird behavior, I did

print(df['ID'].unique())

And, weirdly, I get this:

[ 20. 31. 42. 42. 84. 142. 198. 248. 280. 288. 352. 378. 459. 498.] -- note that 42 is repeated!

My question is, how can there be two 42s when I use the .unique() function? I thought it was supposed to output all the unique values? If I could understand this better, I could start to understand the rest of the problems that ensue...

Am I missing something about how 'unique' works?

Ps. My files are big so I haven't included them, but if I need to provide more (numerical) context please let me know.

Thanks!

Mostly because float does not compare well for equality. There is probably a small difference between your two versions of 42. Which are the answer anyway. — jlandercy
– jlandercy, Commented Apr 11, 2022 at 19:35
Similar to the above, if you use this print(df['ID'].astype(int).unique()), do you still get a strange result? — Rawson
– Rawson, Commented Apr 11, 2022 at 19:36
As a reminder never use float as a key (eg for indexing) mainly for this reason. — jlandercy
– jlandercy, Commented Apr 11, 2022 at 19:38
Hi @SRawson, thanks for the code. When I do that 42 only shows up once! I thought I had avoided having my ID's as a float by using: df['ID'] = pd.to_numeric(df['ID']), but they were still showing up as floats... I had tried: df['ID'] = df['ID'].astype(int), before but this gave me the error: Cannot convert non-finite values (NA or inf) to integer. Thanks for the assistance, I will see what I can do from here... — g.is.stuck
– g.is.stuck, Commented Apr 11, 2022 at 19:42

Rawson · Accepted Answer · 2022-04-11 19:47:26Z

1

Moving my comment to an answer, as it solved the problem:

print(df['ID'].astype(int).unique())

answered Apr 11, 2022 at 19:47

Rawson

2,8521 gold badge7 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

g.is.stuck Over a year ago

Thanks, I now did df = df.dropna(subset = ["ID"]) followed by df['ID'] = df['ID'].astype(int) before my analysis, and all my problems are solved. Thank you!

Collectives™ on Stack Overflow

How can "unique" show duplicate values in a dataframe?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related