DataFrame.drop_duplicates and DataFrame.drop not removing rows

Question

I have read in a csv into a pandas dataframe and it has five columns. Certain rows have duplicate values only in the second column, i want to remove these rows from the dataframe but neither drop nor drop_duplicates is working.

Here is my implementation:

#Read CSV
df = pd.read_csv(data_path, header=0, names=['a', 'b', 'c', 'd', 'e'])

print Series(df.b)

dropRows = []
#Sanitize the data to get rid of duplicates
for indx, val in enumerate(df.b): #for all the values
    if(indx == 0): #skip first indx
        continue

    if (val == df.b[indx-1]): #this is duplicate rtc value
        dropRows.append(indx)

print dropRows

df.drop(dropRows) #this doesnt work
df.drop_duplicates('b') #this doesnt work either

print Series(df.b)

when i print out the series df.b before and after they are the same length and I can visibly see the duplicates still. is there something wrong in my implementation?

drop and duplicates create new datafraames. So you want something like:df = df.drop_duplicates('b') — Karl D.
– Karl D., Commented Sep 6, 2014 at 1:47
By default drop and in fact most pandas operations return a copy, for some and in fact these functions then can be passed the param in_place=true to perform the operation on the original df and not return a copy — EdChum
– EdChum, Commented Sep 6, 2014 at 7:06
I believe that the API was designed this way to ensure that original data in memory were not accidentally written over. It's kinda helpful, if one thinks about it. — ericmjl
– ericmjl, Commented Sep 7, 2014 at 9:15

tktk · Accepted Answer · 2014-09-07 08:01:47Z

18

As mentioned in the comments, drop and drop_duplicates creates a new DataFrame, unless provided with an inplace argument. All these options would work:

df = df.drop(dropRows)
df = df.drop_duplicates('b') #this doesnt work either
df.drop(dropRows, inplace = True)
df.drop_duplicates('b', inplace = True)

answered Sep 7, 2014 at 8:01

community wiki

tktk

Sign up to request clarification or add additional context in comments.

Comments

johnecon · Accepted Answer · 2019-02-25 08:59:14Z

4

In my case the issue was that I was concatenating dfs with columns of different types:

import pandas as pd

s1 = pd.DataFrame([['a', 1]], columns=['letter', 'code'])
s2 = pd.DataFrame([['a', '1']], columns=['letter', 'code'])
df = pd.concat([s1, s2])
df = df.reset_index(drop=True)
df.drop_duplicates(inplace=True)

# 2 rows
print(df)

# int
print(type(df.at[0, 'code']))
# string
print(type(df.at[1, 'code']))

# Fix:
df['code'] = df['code'].astype(str)
df.drop_duplicates(inplace=True)

# 1 row
print(df)

answered Feb 25, 2019 at 8:59

johnecon

3632 silver badges9 bronze badges

Collectives™ on Stack Overflow

DataFrame.drop_duplicates and DataFrame.drop not removing rows

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related