0

I have sql query which I want to convert to spark-scala

SELECT aid,DId,BM,BY 
FROM (SELECT DISTINCT aid,DId,BM,BY,TO FROM SU WHERE cd =2) t 
GROUP BY aid,DId,BM,BY HAVING COUNT(*) >1;

SU is my Data Frame. I did this by

sqlContext.sql("""
  SELECT aid,DId,BM,BY 
  FROM (SELECT DISTINCT aid,DId,BM,BY,TO FROM SU WHERE cd =2) t 
  GROUP BY aid,DId,BM,BY HAVING COUNT(*) >1
""")

Instead of that I need this in utilizing my dataframe

3
  • show what you have tried so far Commented Jan 25, 2017 at 10:13
  • if SU is your dataframe, to use in way you mentioned first you need to register it as a temp table SU.registerTempTable("table_name") and use this table name in your query. Commented Jan 25, 2017 at 10:18
  • @RaphaelRoth val GP = SU.groupBy("aid","DId","BM","BY").agg(countDistinct("aid","DId","BM","BY","TO").alias("count") > 1 ).show . Had registered as temp table but I don't want to use sql query Commented Jan 25, 2017 at 10:29

1 Answer 1

2

This should be the DataFrame equivalent:

SU.filter($"cd" === 2)
  .select("aid","DId","BM","BY","TO")
  .distinct()
  .groupBy("aid","DId","BM","BY")
  .count()
  .filter($"count" > 1)
  .select("aid","DId","BM","BY")
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks it worked fine... But the query is taking long time to execute
the distinct operation is usually expensive. Also, you can look at the numbers of shuffles and try to re-modify your query to decrease the number of shuffles.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.