Pyspark select multiple columns from list and filter on different values

Question

I have a table with ~5k columns and ~1 M rows that looks like this:

ID	Col1	Col2	Col3	Col4	Col5	Col6	Col7	Col8	Col9	Col10	Col11
ID1	0	1	0	1	0	2	1	1	2	2	0
ID2	1	0	1	0	1	2	0	0	0	2	1
ID3	1	0	0	0	0	2	1	0	1	1	1
ID4	0	1	1	1	1	1	1	1	1	0	0
ID5	0	1	1	1	1	1	0	1	0	0	0

I want to select different columns matching column names from different lists and subset the rows according to different criteria. For example if my list1 has col1, col3, col4, col11 and list2 has col2, col6, col9, col10. I want to filter rows as list1 == 0 AND list2 == 1. E.g df1 = df.filter((df.col1 == 0) & (df.col3 == 0) & (df.col4 == 0) & (df.col6== 1) & (df.col9 == 1) & (df.col10 == 1)) . Instead of adding column name each time, I want these columns to be selected from two different lists. How can I achieve this using PySpark?

Does list1 == 0 mean that all values for all selected columns are 0? Anyway, it looks quite simple: select columns + filter on desired values on all or part of the columns... or did I miss something? — Christophe
– Christophe, Commented Oct 4, 2021 at 19:51
Yes, all matching columns in list 1 should have value of 0 and all matching columns in list 2 should have value 1. I am confused how to filter on two or more different criteria. E.g df1 = df.filter((df.col1 == 0) & (df.col3 == 0) & (df.col4 == 0) & (df.col6== 1) & (df.col9 == 1) & (df.col10 == 1)) . Instead of adding column name each time, I want these columns to be selected from two different lists. — Manav K
– Manav K, Commented Oct 4, 2021 at 20:10

Carter Shanklin · Accepted Answer · 2021-10-04 23:55:35Z

1

If you need to compare a lot of columns like this consider tuple-wise comparisons like this:

from pyspark.sql.functions import lit, struct
source_tuple = struct(col("col1"), col("col3"), col("col4"), col("col6"), col("col9"), col("col10"))
target_tuple1 = struct([lit(0), lit(0), lit(0), lit(1), lit(1), lit(1)])
df1 = df.where(source_tuple.isin([target_tuple1]))

You can build the left and right sides dynamically and add multiple tuples to the right-hand side. If you need multiple left-hand sides combine the resulting DataFrames using UNION ALL.

answered Oct 4, 2021 at 23:55

Carter Shanklin

3,04923 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Manav K · Accepted Answer · 2021-10-08 15:21:03Z

0

I found the solution that I was looking for: I was able to filter large number of columns using following steps:

Created separate lists with column IDs. My column IDs were in a Dataframe. So I converted the column to list for filtration process.

list1 = ped.filter((ped.pheno == 2)).select("IID")
list1 = list1.select('IID').rdd.map(lambda row : row [0]).collect()`
list2 = ped.filter((ped.pheno == 1)).select("IID")
list2 = list2.select('IID').rdd.map(lambda row : row [0]).collect()

## 2. Using these lists I filtered the columns as follows:

df1 = df.where ("AND".join([(%s ==1)"%(col) for col in list1]))
df1 = df.where ("AND".join([(%s ==0)"%(col) for col in list2]))

Thanks for other helpful solutions as well!

edited Oct 8, 2021 at 15:21

answered Oct 6, 2021 at 22:55

Manav K

35 bronze badges

Comments

Saurabh Verma · Accepted Answer · 2022-05-02 09:54:34Z

0

Try this, This worked for me

tmp_cols = ['col1','col2']
filter_conditions = [col(c) ==1 for c in tmp_cols]
nestor = lambda u, v : (u) | (v)
df = df.filter(reduce(nestor, filter_conditions))

answered May 2, 2022 at 9:54

Saurabh Verma

11 bronze badge

1 Comment

Community Over a year ago

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Collectives™ on Stack Overflow

Pyspark select multiple columns from list and filter on different values

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

ID	Col1	Col2	Col3	Col4	Col5	Col6	Col7	Col8	Col9	Col10	Col11
ID1	0	1	0	1	0	2	1	1	2	2	0
ID2	1	0	1	0	1	2	0	0	0	2	1
ID3	1	0	0	0	0	2	1	0	1	1	1
ID4	0	1	1	1	1	1	1	1	1	0	0
ID5	0	1	1	1	1	1	0	1	0	0	0

ID	Col1	Col2	Col3	Col4	Col5	Col6	Col7	Col8	Col9	Col10	Col11
ID1	0	1	0	1	0	2	1	1	2	2	0
ID2	1	0	1	0	1	2	0	0	0	2	1
ID3	1	0	0	0	0	2	1	0	1	1	1
ID4	0	1	1	1	1	1	1	1	1	0	0
ID5	0	1	1	1	1	1	0	1	0	0	0

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related

ID	Col1	Col2	Col3	Col4	Col5	Col6	Col7	Col8	Col9	Col10	Col11
ID1	0	1	0	1	0	2	1	1	2	2	0
ID2	1	0	1	0	1	2	0	0	0	2	1
ID3	1	0	0	0	0	2	1	0	1	1	1
ID4	0	1	1	1	1	1	1	1	1	0	0
ID5	0	1	1	1	1	1	0	1	0	0	0