1

Currently I am doing:

val DF = sqlSession.sql("select itemIdDig as itemId, "
      + "title"
      + "timestamp as time "
      + "from itemTable ")

val tempDF = sqlSession.sql("select itemIdDig as itemId "
      + "from itemTable "
      + "group by itemIdDig HAVING count(*) >= 10 ").rdd.map(r => r(0)).collect()


//keep itemIds which are not in DF
DF.filter(!col("itemId").isin(tempDF  : _*)).toDF

But this is very slow. Can someone suggest me any better ways to achieve this? Basically I am looking from rows which is not in tempDF (I tried with group by having which gives me unique itemId but I want to preserve the duplicates)

1 Answer 1

2

Just semi join:

DF.join(tempDF,  Seq("itemId"), "leftanti")
Sign up to request clarification or add additional context in comments.

2 Comments

care to explain semi join ?
I think it is "left_anti. It gives me different size with the way I do and ur way.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.