0

I have two mongodb collections, each with millions of records (say: 30 gb each). I am trying to perform a join on the data after loading the dataframes into spark. What I have noticed is, is that spark does lazy loading in this case. (transformation/action scenario)

My pyspark code looks like this

df1=spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con).load()
df2 = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con2).load()
res = df1.join(df2,df1.xyz = df2.xyz, "inner")

Now, all these lines execute "almost" immediately. But, when I run.

res.show()

It takes ages to execute. My assessment is that the join is taking place within mongodb, and not spark.

Is there any way to evade lazy loading, and make sure that both the dataframes are in Spark, and join takes place in Spark. (I am trying to explore if there is any visible difference between joining data on mongodb vs spark)

6
  • 1
    There is a huge misunderstand here about how spark works. Spark will evaluate the first 3 lines in a lazy manner. This means, he will just compute the DAG and that's about it. On the other hand, res.show will trigger an action on a partition of the data which will trigger the load for df1 and df2 and then perform the join. Mongo just sends the data spark asked for. Commented Nov 15, 2018 at 15:04
  • Also one problem might be that the MongoDB connector might not support push down of the join to the MongoDB itself (as in: Spark asks the MongoDB to do the join and does not do the join itself). Joining 30GB each within memory might be really slow, also depending how many workers you have and how much memory they can use. Databases are usually good at doing joins :) (with the right indexes). Commented Nov 15, 2018 at 16:33
  • @eliasah I think I have not phrased correctly, hence the misunderstanding. I know that the action won't be triggered while I am asking for transformations, and only be triggered upon res.show(). I am trying to find a way where I can perform the "join operation" on Spark instead of mongodb. My assumption was that if both the dataframes will be in Spark's memory - the join will be performed there. Commented Nov 19, 2018 at 7:08
  • @Frank This is true, but I want to see if Spark's distributed environment can somehow speedup join performance vs mongo. I know Spark is asking MongoDB to do the join, but I want it to somehow do the join itself. (How?) Commented Nov 19, 2018 at 7:09
  • Here is the thing. The last time that I've checked, the mongo spark connector supports just the select and where (match in mongodb dsl) pushdown predicates. Commented Nov 19, 2018 at 8:23

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.