0

Background: I am very confused by my dataframe (df), which when I do some simple analyses is producing random rows for a specific value within my column named 'ID' (specifically, when ID == 42). As a result, I have started to do some troubleshooting.

When I try to list all the rows where ID = 42, I do:

data=df.loc[df['ID'] == 42]

And the rows look correct in this new variable called 'data'. However, when I scroll manually through the original dataframe df (e.g., in the Variable Explorer on Spyder), I can see there are way more rows for ID=42 that are not being printed to 'data'.

Then, to double check why the 'ID' values are showing this weird behavior, I did

print(df['ID'].unique())

And, weirdly, I get this:

[ 20. 31. 42. 42. 84. 142. 198. 248. 280. 288. 352. 378. 459. 498.] -- note that 42 is repeated!

My question is, how can there be two 42s when I use the .unique() function? I thought it was supposed to output all the unique values? If I could understand this better, I could start to understand the rest of the problems that ensue...

Am I missing something about how 'unique' works?

Ps. My files are big so I haven't included them, but if I need to provide more (numerical) context please let me know.

Thanks!

5
  • It should be float Commented Apr 11, 2022 at 19:35
  • 1
    Mostly because float does not compare well for equality. There is probably a small difference between your two versions of 42. Which are the answer anyway. Commented Apr 11, 2022 at 19:35
  • 1
    Similar to the above, if you use this print(df['ID'].astype(int).unique()), do you still get a strange result? Commented Apr 11, 2022 at 19:36
  • 1
    As a reminder never use float as a key (eg for indexing) mainly for this reason. Commented Apr 11, 2022 at 19:38
  • Hi @SRawson, thanks for the code. When I do that 42 only shows up once! I thought I had avoided having my ID's as a float by using: df['ID'] = pd.to_numeric(df['ID']), but they were still showing up as floats... I had tried: df['ID'] = df['ID'].astype(int), before but this gave me the error: Cannot convert non-finite values (NA or inf) to integer. Thanks for the assistance, I will see what I can do from here... Commented Apr 11, 2022 at 19:42

1 Answer 1

1

Moving my comment to an answer, as it solved the problem:

print(df['ID'].astype(int).unique())
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, I now did df = df.dropna(subset = ["ID"]) followed by df['ID'] = df['ID'].astype(int) before my analysis, and all my problems are solved. Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.