0

I have the following Dataframe:

ID Payment Value Date
1 Cash 200 2020-01-01
1 Credit Card 500 2020-01-06
2 Cash 300 2020-02-01
3 Credit Card 400 2020-02-02
3 Credit Card 500 2020-01-03
3 Cash 200 2020-01-04

What I'd like to do is to count how many ID's have used both Cash and Credit Card.

For example, in this case there would be 2 ID's that used both Cash and Credit Card.

How would I do that on PySpark?

1 Answer 1

1

You can use collect_set to count how many payment methods each user has.

from pyspark.sql import functions as F

(df
    .groupBy('ID')
    .agg(F.collect_set('Payment').alias('methods'))
    .withColumn('methods_size', F.size('methods'))
    .show()
)

# +---+-------------------+------------+
# | ID|            methods|methods_size|
# +---+-------------------+------------+
# |  1|[Credit Card, Cash]|           2|
# |  3|[Credit Card, Cash]|           2|
# |  2|             [Cash]|           1|
# +---+-------------------+------------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.