I have two mongodb collections, each with millions of records (say: 30 gb each). I am trying to perform a join on the data after loading the dataframes into spark. What I have noticed is, is that spark does lazy loading in this case. (transformation/action scenario)
My pyspark code looks like this
df1=spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con).load()
df2 = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",con2).load()
res = df1.join(df2,df1.xyz = df2.xyz, "inner")
Now, all these lines execute "almost" immediately. But, when I run.
res.show()
It takes ages to execute. My assessment is that the join is taking place within mongodb, and not spark.
Is there any way to evade lazy loading, and make sure that both the dataframes are in Spark, and join takes place in Spark. (I am trying to explore if there is any visible difference between joining data on mongodb vs spark)