Convert Spark SQL to Dataframe API

Question

I am new to Pyspark.I am looking to convert the below spark SQL to dataframe API

sql("SELECT
t.transaction_category_id,
sum(t.transaction_amount) AS sum_amount,
count(DISTINCT t.user_id) AS num_users
FROM transactions t
JOIN users u USING (user_id)
WHERE t.is_blocked = False
AND u.is_active = 1
GROUP BY t.transaction_category_id
ORDER BY sum_amount DESC").show()

The tables are uneven where the transactions tables is a large table.I am looking if I can apply broadcast join/salting?

Subash · Accepted Answer · 2022-06-02 19:34:20Z

1

You can also use the below

  import pyspark.sql.functions as func
  output_df = transactions.join(broadcast(users), users.user_id
                              == transactions.user_id).where((transactions.is_blocked
        == False) & (users.is_active
        == 1)).groupBy(transactions.transaction_category_id).agg(func.countDistinct(users.user_id).alias('num_users'
        ), func.sum(transactions.transaction_amount).alias('sum_amount'
        )).select(transactions.transaction_category_id, 'num_users',
                  'sum_amount')

answered Jun 2, 2022 at 19:34

Subash

8951 gold badge8 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ARCrow · Accepted Answer · 2022-06-02 19:15:53Z

1

The join part of the query would look like:

import pyspark.sql.functions as f
output_df = (
    transactions.alias('t')
    .join(users.alias('u').hint('broadcast'), ['user_id'], 'inner')
    .where((f.col('t.is_blocked') == False) & (f.col('u.is_active') == 1))
    .groupBy(f.col('t.transaction_category_id'))
    .agg(
        f.sum(f.col('t.transaction_amount')).alias('sum_amount'),
        f.count_distinct(f.col('t.user_id')).alias('num_users')
    )
    .orderBy(f.col('sum_amount'))
)

edited Jun 2, 2022 at 19:15

answered Jun 2, 2022 at 17:54

ARCrow

1,8673 gold badges14 silver badges34 bronze badges

4 Comments

Durga Over a year ago

Its failing with error-'dict' object has no attribute 'alias'

ARCrow Over a year ago

aren't transactions and users dataframes?

Durga Over a year ago

Ok I changed it now.Any idea how to do sum and group by

ARCrow Over a year ago

updated the query

Collectives™ on Stack Overflow

Convert Spark SQL to Dataframe API

2 Answers 2

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related