0

I want to get all the possible combinations of size 2 of a column in pyspark dataframe. My pyspark dataframe looks like

| id |
|  1 |
|  2 |
|  3 |
|  4 |

For above input, I want to get output as 

| id1 |  id2 |
|  1  |   2  |
|  1  |   3  |
|  1  |   4  |
|  2  |   3  |
and so on..

One way would be to collect the values and get them into python iterable (list, pandas df) and use itertools.combinations to generate all combinations.

values = df.select(F.collect_list('id')).first()[0]
combns = list(itertools.combinations(values, 2))

However, I want to avoid collecting the dataframe column to the driver since the rows can be extremely large. Is there a better way to achieve this using spark APIs?

1

1 Answer 1

1

You can use the crossJoin method, and then cull the lines with id1 > id2.

df = df.toDF('id1').crossJoin(df.toDF('id2')).filter('id1 < id2')
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.