1

i need a Pyspark solution for Pandas drop_duplicates(keep=False). Unfortunately, the keep=False option is not available in pyspark...

Pandas Example:

import pandas as pd

df_data = {'A': ['foo', 'foo', 'bar'], 
         'B': [3, 3, 5],
         'C': ['one', 'two', 'three']}
df = pd.DataFrame(data=df_data)
df = df.drop_duplicates(subset=['A', 'B'], keep=False)
print(df)

Expected output:

     A  B       C
2  bar  5  three

A conversion .to_pandas() and back to pyspark is not an option.

Thanks!

2
  • Groupby ['A', 'B'], get the count of groups, and remove all groups with size > 1. Not sure how to do this with pyspark but that's how I'd do it in the absence of keep=False. Commented Jan 9, 2019 at 18:49
  • Possible duplicate of Remove all rows that are duplicates with respect to some rows Commented Jan 9, 2019 at 23:02

2 Answers 2

3

Use window function to count the number of rows for each A / B combination, and then filter the result to keep only rows that are unique:

import pyspark.sql.functions as f

df.selectExpr(
  '*', 
  'count(*) over (partition by A, B) as cnt'
).filter(f.col('cnt') == 1).drop('cnt').show()

+---+---+-----+
|  A|  B|    C|
+---+---+-----+
|bar|  5|three|
+---+---+-----+

Or another option using pandas_udf:

from pyspark.sql.functions import pandas_udf, PandasUDFType

# keep_unique returns the data frame if it has only one row, otherwise 
# drop the group
@pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def keep_unique(df):
    return df.iloc[:0] if len(df) > 1 else df

df.groupBy('A', 'B').apply(keep_unique).show()
+---+---+-----+
|  A|  B|    C|
+---+---+-----+
|bar|  5|three|
+---+---+-----+
Sign up to request clarification or add additional context in comments.

Comments

0

Simple way is to count such rows and then select only those with have one occurrence to avoid any row from duplicates and then drop the extra column.

df= df.groupBy('A', 'B').agg(f.expr('count(*)').alias('Frequency'))
df=df.select('*').where(df.Frequency==1)
df=df.drop('Frequency')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.