Pyspark : Deleting columns based on sub set of a string

Question

I have a dataframe ; which looks like below

id   1id  id2  ac1  2ac tre tye

I want to delete the columns which contain "id" and "ac" in them and retain the others

How will I achieve this in pyspark?

Tried "select statements" doesn't work

How should I use regexep on column names here?

blackbishop · Accepted Answer · 2020-02-06 14:17:55Z

1

Use a simple list comprehension:

Using Select

df.select(*[col(c) for c in df.columns if not("id" in c or "ac" in c)]).show()

Using Drop

df.drop(*[c for c in df.columns if "id" in c or "ac" in c]).show()

answered Feb 6, 2020 at 14:17

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

1 Answer 1