Converting Sql query to spark

Question

I have sql query which I want to convert to spark-scala

SELECT aid,DId,BM,BY 
FROM (SELECT DISTINCT aid,DId,BM,BY,TO FROM SU WHERE cd =2) t 
GROUP BY aid,DId,BM,BY HAVING COUNT(*) >1;

SU is my Data Frame. I did this by

sqlContext.sql("""
  SELECT aid,DId,BM,BY 
  FROM (SELECT DISTINCT aid,DId,BM,BY,TO FROM SU WHERE cd =2) t 
  GROUP BY aid,DId,BM,BY HAVING COUNT(*) >1
""")

Instead of that I need this in utilizing my dataframe

if SU is your dataframe, to use in way you mentioned first you need to register it as a temp table SU.registerTempTable("table_name") and use this table name in your query. — Rajat Mishra
– Rajat Mishra, Commented Jan 25, 2017 at 10:18
@RaphaelRoth val GP = SU.groupBy("aid","DId","BM","BY").agg(countDistinct("aid","DId","BM","BY","TO").alias("count") > 1 ).show . Had registered as temp table but I don't want to use sql query — Anji
– Anji, Commented Jan 25, 2017 at 10:29

Tzach Zohar · Accepted Answer · 2017-01-25 11:52:28Z

2

This should be the DataFrame equivalent:

SU.filter($"cd" === 2)
  .select("aid","DId","BM","BY","TO")
  .distinct()
  .groupBy("aid","DId","BM","BY")
  .count()
  .filter($"count" > 1)
  .select("aid","DId","BM","BY")

answered Jan 25, 2017 at 11:52

Tzach Zohar

37.9k3 gold badges83 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Anji Over a year ago

Thanks it worked fine... But the query is taking long time to execute

Karan Sharma Over a year ago

the distinct operation is usually expensive. Also, you can look at the numbers of shuffles and try to re-modify your query to decrease the number of shuffles.

Collectives™ on Stack Overflow

Converting Sql query to spark

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related