40

I have a large number of columns in a PySpark dataframe, say 200. I want to select all the columns except say 3-4 of the columns. How do I select this columns without having to manually type the names of all the columns I want to select?

3
  • use drop with columns you'd like to exclude. Commented Jun 13, 2018 at 13:14
  • 3
    df.select([c for c in df.columns if c not in {'GpuName','GPU1_TwoPartHwID'}]) Commented Jun 13, 2018 at 14:18
  • 2
    Possible duplicate of How to exclude multiple columns in Spark dataframe in Python Commented Jun 13, 2018 at 14:18

5 Answers 5

71

In the end, I settled for the following :

  • Drop:

    df.drop('column_1', 'column_2', 'column_3')

  • Select :

    df.select([c for c in df.columns if c not in {'column_1', 'column_2', 'column_3'}])

Sign up to request clarification or add additional context in comments.

1 Comment

this works like charm
7

PySpark SQL: "SELECT * except(col6, col7, col8)"

1 Comment

Exactly what I'm looking for and works as expected. Looks like it's widely implemented as well. Can't believe I've never came across this.
4

this might be helpful

df_cols = list(set(df.columns) - {'<col1>','<col2>',....})

df.select(df_cols).show()

Comments

3
df.drop(*[cols for cols in [list of columns to drop]])

Useful if the list to drop columns is huge. or if the list can be derived programmatically.

Comments

2

Another way to do it in pyspark:

import pyspark.sql.functions as F

df.select(F.expr("* EXCEPT( COLUMN 1, COLUMN 2)"))

or

df.selectExpr("* EXCEPT( COLUMN 1, COLUMN 2))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.