Filter dataframe based on another data frame scala

Question

Currently I am doing:

val DF = sqlSession.sql("select itemIdDig as itemId, "
      + "title"
      + "timestamp as time "
      + "from itemTable ")

val tempDF = sqlSession.sql("select itemIdDig as itemId "
      + "from itemTable "
      + "group by itemIdDig HAVING count(*) >= 10 ").rdd.map(r => r(0)).collect()


//keep itemIds which are not in DF
DF.filter(!col("itemId").isin(tempDF  : _*)).toDF

But this is very slow. Can someone suggest me any better ways to achieve this? Basically I am looking from rows which is not in tempDF (I tried with group by having which gives me unique itemId but I want to preserve the duplicates)

Community · Accepted Answer · 2018-01-17 04:36:03Z

2

Just semi join:

DF.join(tempDF,  Seq("itemId"), "leftanti")

edited Jan 17, 2018 at 4:36

CommunityBot

11 silver badge

answered Jan 16, 2018 at 22:46

user9226757

383 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user3407267 Over a year ago

care to explain semi join ?

user3407267 Over a year ago

I think it is "left_anti. It gives me different size with the way I do and ur way.

Collectives™ on Stack Overflow

Filter dataframe based on another data frame scala

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related