How to deal with variable number of columns in dataframe

Question

In the dataframe I have there are so many columns of which I only need a few. For instance

Col_A      Col_B      Col_C      Col_D      Col_E      Col_F
...        ...        ...      ...      ...      ...      ...

I only need columns Col_A, Col_C and Col_E so currently what I do is df = df[['Col_A', 'Col_C', 'Col_E']] but the issue here is that not always there will columns A, C and E maybe all these wont be present. So I need if Col_A is in df.columns add to the df and so on. Is there any simple method to do this ? rather than so many if? Now if a Column is missing I get a KeyError: "['Col_C'] not in index

jezrael · Accepted Answer · 2020-05-29 06:06:59Z

1

Use Index.intersection:

df[df.columns.intersection(['Col_A','Col_A','Col_E'], sort=False)]

answered May 29, 2020 at 6:06

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user13494862 Over a year ago

This is way faster than other answers. Can this be used in groupby like df.groupby(by=[df.columns.intersection(['Col_A','Col_A','Col_E'], as_index=False) ?

jezrael Over a year ago

@Derik81 - Very similar, like df.groupby(by=df.columns.intersection(['Col_A','Col_A','Col_E']), as_index=False)

user13494862 Over a year ago

I think there is some issue by using this. it gives a ValueError: Grouper and axis must be same length. When I checked the result of df.columns.intersection(['Col_A','Col_A','Col_E']) it prints as Index(['Col_A','Col_A','Col_E'], dtype='object'). It has no issue when I do .groupby(by=['Col_A','Col_A','Col_E'] but when I input as you gave it throws an error.

jezrael Over a year ago

@Derik81 - I forget, need convert to list df.groupby(by=df.columns.intersection(['Col_A','Col_A','Col_E']).tolist(), as_index=False)

jezrael · Accepted Answer · 2020-05-29 06:07:44Z

1

You can use loc and isin

df.loc[:, df.columns.isin(['a','b','c'])]

edited May 29, 2020 at 6:07

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

answered May 29, 2020 at 6:04

Dishin Goyani

7,7533 gold badges33 silver badges42 bronze badges

Comments

Michael Delgado · Accepted Answer · 2020-05-29 05:48:44Z

0

You could use a list comprehension. For example:

test_columns = ['Col_A', 'Col_C', 'Col_E']
df = df[[c for c in test_columns if c in df.columns]]

answered May 29, 2020 at 5:48

Michael Delgado

15.7k4 gold badges39 silver badges65 bronze badges

Comments

Yash Randive · Accepted Answer · 2020-05-29 05:51:26Z

0

From what I have interpreted, you could create a copy of df, store it in another variable and then 'drop' the columns you don't require,

df_copy = df.copy()
df = df.drop(['Col_A', 'Col_C', 'Col_E'], axis = 1)

# If you want to add other columns to the df
df['Col_B'] = df_copy['Col_B']

answered May 29, 2020 at 5:51

Yash Randive

616 bronze badges

Collectives™ on Stack Overflow

How to deal with variable number of columns in dataframe

4 Answers 4

4 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related