pyspark create all possible combinations of column values of a dataframe

Question

I want to get all the possible combinations of size 2 of a column in pyspark dataframe. My pyspark dataframe looks like

| id |
|  1 |
|  2 |
|  3 |
|  4 |

For above input, I want to get output as 

| id1 |  id2 |
|  1  |   2  |
|  1  |   3  |
|  1  |   4  |
|  2  |   3  |
and so on..

One way would be to collect the values and get them into python iterable (list, pandas df) and use itertools.combinations to generate all combinations.

values = df.select(F.collect_list('id')).first()[0]
combns = list(itertools.combinations(values, 2))

However, I want to avoid collecting the dataframe column to the driver since the rows can be extremely large. Is there a better way to achieve this using spark APIs?

what about crossjoin? spark.apache.org/docs/latest/api/python/reference/api/… — anky
– anky, Commented Feb 11, 2022 at 4:59

Pierre Gourseaud · Accepted Answer · 2022-06-01 16:59:34Z

1

You can use the crossJoin method, and then cull the lines with id1 > id2.

df = df.toDF('id1').crossJoin(df.toDF('id2')).filter('id1 < id2')

edited Jun 1, 2022 at 16:59

Pierre Gourseaud

2,49716 silver badges24 bronze badges

answered Feb 11, 2022 at 5:14

过过招

4,3372 gold badges7 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

pyspark create all possible combinations of column values of a dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related