How to remove duplicate columns from a dataframe using python pandas

Question

By grouping two columns I made some changes.

I generated a file using python, it resulted in 2 duplicate columns. How to remove duplicate columns from a dataframe?

Do they have same column name?

waitingkuo
– waitingkuo

2013-06-05 11:38:35 +00:00
Commented Jun 5, 2013 at 11:38 — waitingkuo
– waitingkuo, Commented Jun 5, 2013 at 11:38

Andy Hayden · Accepted Answer · 2013-06-05 12:11:46Z

23

It's probably easiest to use a groupby (assuming they have duplicate names too):

In [11]: df
Out[11]:
   A  B  B
0  a  4  4
1  b  4  4
2  c  4  4

In [12]: df.T.groupby(level=0).first().T
Out[12]:
   A  B
0  a  4
1  b  4
2  c  4

If they have different names you can drop_duplicates on the transpose:

In [21]: df
Out[21]:
   A  B  C
0  a  4  4
1  b  4  4
2  c  4  4

In [22]: df.T.drop_duplicates().T
Out[22]:
   A  B
0  a  4
1  b  4
2  c  4

Usually read_csv will usually ensure they have different names...

edited Jun 5, 2013 at 12:11

answered Jun 5, 2013 at 12:05

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jeff Over a year ago

FYI @Andy, there is a new option in 0.11.1 that controls this mangle_dup_cols; default is TO mangle (e.g. produce unique cols), in 0.12, this will change to leave dups in place

Community · Accepted Answer · 2017-05-23 12:18:14Z

4

Transposing is a bad idea when working with large DataFrames. See this answer for a memory efficient alternative: https://stackoverflow.com/a/32961145/759442

edited May 23, 2017 at 12:18

CommunityBot

11 silver badge

answered Oct 6, 2015 at 3:24

kalu

2,6821 gold badge23 silver badges22 bronze badges

1 Comment

Alter Over a year ago

Just a note for others that the best answer is not the accepted one in that thread. Best answer -> stackoverflow.com/a/40435354/2507197

Francisco López-Sancho · Accepted Answer · 2016-04-10 12:06:04Z

3

This is the best I found so far.

remove = []
cols = df.columns
for i in range(len(cols)-1):
    v = df[cols[i]].values
    for j in range(i+1,len(cols)):
        if np.array_equal(v,df[cols[j]].values):
            remove.append(cols[j])

df.drop(remove, axis=1, inplace=True)

https://www.kaggle.com/kobakhit/santander-customer-satisfaction/0-84-score-with-36-features-only/code

answered Apr 10, 2016 at 12:06

Francisco López-Sancho

3,37731 silver badges22 bronze badges

Comments

yugandhar · Accepted Answer · 2019-12-13 09:16:46Z

3

It's already answered here python pandas remove duplicate columns. Idea is that df.columns.duplicated() generates boolean vector where each value says whether it has seen the column before or not. For example, if df has columns ["Col1", "Col2", "Col1"], then it generates [False, False, True]. Let's take inversion of it and call it as column_selector.

Using the above vector and using loc method of df which helps in selecting rows and columns, we can remove the duplicate columns. With df.loc[:, column_selector] we can select columns.

column_selector = ~df.columns().duplicated()
df = df.loc[:, column_selector]

answered Dec 13, 2019 at 9:16

yugandhar

7109 silver badges17 bronze badges

3 Comments

davidavr Over a year ago

This is the best answer as it actually drops only the duplicate columns. Most of the other answers I've seen will drop the original and the duplicates.

Kots Over a year ago

.columns is not a callable.

yugandhar Over a year ago

Can you check type on which you are calling, this works only for pandas DataFrame type. You can use typeof <varname> to check the type

Dee Carter · Accepted Answer · 2017-06-21 17:17:41Z

0

I understand that this is an old question, but I recently had this same issue and none of these solutions worked for me, or the looping suggestion seemed a bit overkill. In the end, I simply found the index of the undesirable duplicate column and dropped that column index. So provided you know the index of the column this will work (which you could probably find via debugging or print statements):

df.drop(df.columns[i], axis=1)

answered Jun 21, 2017 at 17:17

Dee Carter

4284 silver badges15 bronze badges

Comments

Alexandr Kosolapov · Accepted Answer · 2022-04-30 18:23:21Z

0

The fast solution for dataset without NANs:

share = 0.05
dfx = df.sample(int(df.shape[0]*share))
dfx = dfx.T.drop_duplicates().T
df = df[dfx.columns]

answered Apr 30, 2022 at 18:23

Alexandr Kosolapov

1731 silver badge4 bronze badges

Collectives™ on Stack Overflow

How to remove duplicate columns from a dataframe using python pandas

6 Answers 6

1 Comment

1 Comment

Comments

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

1 Comment

Comments

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related