PySpark: Filter out rows where column value appears multiple times in dataframe

Question

Given a PySpark dataframe with two columns, I want to split the data set into two dataframes: One where the combination of ColA and ColB is unique, and one where it is non-unique.

So for example:

ColA | ColB | ColCDoesntMatter
1    | 2    | 4
1    | 3    | 3
2    | 2    | 2
1    | 2    | 6
2    | 2    | 8
2    | 3    | 9

The unique A/B pairings are:

1-3
2-3

The nonunique A/B pairings are:

1-2
2-2

So one dataframe would have all rows with the unique AB values and the other would contain the nonunique AB values. ColC doesn't matter for the filter, but needs to be retained. How can I run this filter?

Edit: Note that I can't use dropDuplicates because it's really the combination of A and B that needs to be unique, not merely A or B.

Use groupBy (or Window) and count. Take a look at this post. — pault
– pault, Commented Feb 21, 2019 at 20:31
Possible duplicate of Keep only duplicates from a DataFrame regarding some field — pault
– pault, Commented Feb 21, 2019 at 20:40

akuiper · Accepted Answer · 2019-02-21 20:31:17Z

2

# count rows per A-B
dfWithRowCount = df.selectExpr("*", "count('*') over (partition by (ColA, ColB)) as nrows")

# filter
uniqueAB = dfWithRowCount.filter(dfWithRowCount.nrows == 1).drop('nrows')
nonUniqueAB = dfWithRowCount.filter(dfWithRowCount.nrows > 1).drop('nrows')

nonUniqueAB.show()
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
|   2|   2|   2|
|   2|   2|   8|
|   1|   2|   4|
|   1|   2|   6|
+----+----+----+

uniqueAB.show()
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
|   1|   3|   3|
|   2|   3|   9|
+----+----+----+

answered Feb 21, 2019 at 20:31

akuiper

216k33 gold badges363 silver badges380 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PySpark: Filter out rows where column value appears multiple times in dataframe

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related