Communities for your favorite technologies. Explore all Collectives
Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Bring the best of human thought and AI automation together at your work. Learn more
Find centralized, trusted content and collaborate around the technologies you use most.
Stack Internal
Knowledge at work
Bring the best of human thought and AI automation together at your work.
I have a large number of columns in a PySpark dataframe, say 200. I want to select all the columns except say 3-4 of the columns. How do I select this columns without having to manually type the names of all the columns I want to select?
drop
df.select([c for c in df.columns if c not in {'GpuName','GPU1_TwoPartHwID'}])
In the end, I settled for the following :
Drop:
df.drop('column_1', 'column_2', 'column_3')
Select :
df.select([c for c in df.columns if c not in {'column_1', 'column_2', 'column_3'}])
Add a comment
PySpark SQL: "SELECT * except(col6, col7, col8)"
df_cols = list(set(df.columns) - {'<col1>','<col2>',....}) df.select(df_cols).show()
df.drop(*[cols for cols in [list of columns to drop]])
Useful if the list to drop columns is huge. or if the list can be derived programmatically.
Another way to do it in pyspark:
import pyspark.sql.functions as F df.select(F.expr("* EXCEPT( COLUMN 1, COLUMN 2)"))
or
df.selectExpr("* EXCEPT( COLUMN 1, COLUMN 2))
Required, but never shown
By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.
Start asking to get answers
Find the answer to your question by asking.
Explore related questions
See similar questions with these tags.
dropwith columns you'd like to exclude.df.select([c for c in df.columns if c not in {'GpuName','GPU1_TwoPartHwID'}])