11

By grouping two columns I made some changes.

I generated a file using python, it resulted in 2 duplicate columns. How to remove duplicate columns from a dataframe?

1
  • Do they have same column name? Commented Jun 5, 2013 at 11:38

6 Answers 6

23

It's probably easiest to use a groupby (assuming they have duplicate names too):

In [11]: df
Out[11]:
   A  B  B
0  a  4  4
1  b  4  4
2  c  4  4

In [12]: df.T.groupby(level=0).first().T
Out[12]:
   A  B
0  a  4
1  b  4
2  c  4

If they have different names you can drop_duplicates on the transpose:

In [21]: df
Out[21]:
   A  B  C
0  a  4  4
1  b  4  4
2  c  4  4

In [22]: df.T.drop_duplicates().T
Out[22]:
   A  B
0  a  4
1  b  4
2  c  4

Usually read_csv will usually ensure they have different names...

Sign up to request clarification or add additional context in comments.

1 Comment

FYI @Andy, there is a new option in 0.11.1 that controls this mangle_dup_cols; default is TO mangle (e.g. produce unique cols), in 0.12, this will change to leave dups in place
4

Transposing is a bad idea when working with large DataFrames. See this answer for a memory efficient alternative: https://stackoverflow.com/a/32961145/759442

1 Comment

Just a note for others that the best answer is not the accepted one in that thread. Best answer -> stackoverflow.com/a/40435354/2507197
3

This is the best I found so far.

remove = []
cols = df.columns
for i in range(len(cols)-1):
    v = df[cols[i]].values
    for j in range(i+1,len(cols)):
        if np.array_equal(v,df[cols[j]].values):
            remove.append(cols[j])

df.drop(remove, axis=1, inplace=True)

https://www.kaggle.com/kobakhit/santander-customer-satisfaction/0-84-score-with-36-features-only/code

Comments

3

It's already answered here python pandas remove duplicate columns. Idea is that df.columns.duplicated() generates boolean vector where each value says whether it has seen the column before or not. For example, if df has columns ["Col1", "Col2", "Col1"], then it generates [False, False, True]. Let's take inversion of it and call it as column_selector.

Using the above vector and using loc method of df which helps in selecting rows and columns, we can remove the duplicate columns. With df.loc[:, column_selector] we can select columns.

column_selector = ~df.columns().duplicated()
df = df.loc[:, column_selector]

3 Comments

This is the best answer as it actually drops only the duplicate columns. Most of the other answers I've seen will drop the original and the duplicates.
.columns is not a callable.
Can you check type on which you are calling, this works only for pandas DataFrame type. You can use typeof <varname> to check the type
0

I understand that this is an old question, but I recently had this same issue and none of these solutions worked for me, or the looping suggestion seemed a bit overkill. In the end, I simply found the index of the undesirable duplicate column and dropped that column index. So provided you know the index of the column this will work (which you could probably find via debugging or print statements):

df.drop(df.columns[i], axis=1)

Comments

0

The fast solution for dataset without NANs:

share = 0.05
dfx = df.sample(int(df.shape[0]*share))
dfx = dfx.T.drop_duplicates().T
df = df[dfx.columns]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.