3

Given a PySpark dataframe with two columns, I want to split the data set into two dataframes: One where the combination of ColA and ColB is unique, and one where it is non-unique.

So for example:

ColA | ColB | ColCDoesntMatter
1    | 2    | 4
1    | 3    | 3
2    | 2    | 2
1    | 2    | 6
2    | 2    | 8
2    | 3    | 9

The unique A/B pairings are:

1-3
2-3

The nonunique A/B pairings are:

1-2
2-2

So one dataframe would have all rows with the unique AB values and the other would contain the nonunique AB values. ColC doesn't matter for the filter, but needs to be retained. How can I run this filter?

Edit: Note that I can't use dropDuplicates because it's really the combination of A and B that needs to be unique, not merely A or B.

2

1 Answer 1

2
# count rows per A-B
dfWithRowCount = df.selectExpr("*", "count('*') over (partition by (ColA, ColB)) as nrows")

# filter
uniqueAB = dfWithRowCount.filter(dfWithRowCount.nrows == 1).drop('nrows')
nonUniqueAB = dfWithRowCount.filter(dfWithRowCount.nrows > 1).drop('nrows')

nonUniqueAB.show()
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
|   2|   2|   2|
|   2|   2|   8|
|   1|   2|   4|
|   1|   2|   6|
+----+----+----+

uniqueAB.show()
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
|   1|   3|   3|
|   2|   3|   9|
+----+----+----+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.