Writing data into MongoDB with Spark

Question

When I tried to write a spark dataframe into mongodb, I found that spark only create one task to do it. This cause bad performance because only one executor is actually running even if allocate many executors in the job.

my partial pyspark code:

df.write.format("com.mongodb.spark.sql.DefaultSource") \
    .mode("append") \
    .option("spark.mongodb.output.uri", connectionString) \
    .save()

Could spark running multiple task in this case? Thanks

Spark submit:

spark-submit --master yarn --num-executors 3 --executor-memory 5g --jars $JARS_PATH/mongo-java-driver-3.5.0.jar,$JARS_PATH/mongodb-driver-core-3.5.0.jar,$JARS_PATH/mongo-spark-connector_2.11-2.2.1.jar spark-mongo.py

I found log that contain this INFO

INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, linxtd-itbigd04, executor 1, partition 0, PROCESS_LOCAL, 4660 bytes)
INFO BlockManagerMasterEndpoint: Registering block manager linxtd-itbigd04:36793 with 1458.6 MB RAM, BlockManagerId(1, linxtd-itbigd04, 36793, None)
INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on linxtd-itbigd04:36793 (size: 19.7 KB, free: 1458.6 MB)
INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 17364 ms on linxtd-itbigd04 (executor 1) (1/1)

And the question is? You have not asked any question. What you wrote here is a "statement" of what you did and what you "think". What would you be asking? — Neil Lunn
– Neil Lunn, Commented Nov 10, 2017 at 5:06
Share your spark submit command and your environment details — Nachiket Kate
– Nachiket Kate, Commented Nov 10, 2017 at 5:40
@Barry what type of tasks are we talking about here ? If you mean saving data, normally this is what's happening. The bottleneck would be network IO and how much data mongo can ingest before it crashes. — eliasah
– eliasah, Commented Nov 10, 2017 at 8:56
@eliasah The tasks are Spark tasks which in a Spark job. I hope that I can import my data from MSSQL into MongoDB by using Spark and doing this distributed in my cluster (cause I have 10+ billion records). But I found that Spark only create ONE job with ONE task and running on ONE Node. That's my question. — Barry
– Barry, Commented Nov 14, 2017 at 3:16

eliasah · Accepted Answer · 2017-11-16 06:22:58Z

1

Like I suspected, and mentioned in the comments, your data wasn’t partitioned thus Spark created one task to deal with it.

You have to be careful when using the jdbc source if you don’t provide a partition reading and writing data wont be parallelized and you end up with one task.

You can read more about this topic in one of my spark gotchas - Reading data using jdbc source.

Disclaimer: I’m on of the co-authors of that repo.

answered Nov 16, 2017 at 6:22

eliasah

40.5k12 gold badges128 silver badges159 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Writing data into MongoDB with Spark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related