1

When I tried to write a spark dataframe into mongodb, I found that spark only create one task to do it. This cause bad performance because only one executor is actually running even if allocate many executors in the job.

my partial pyspark code:

df.write.format("com.mongodb.spark.sql.DefaultSource") \
    .mode("append") \
    .option("spark.mongodb.output.uri", connectionString) \
    .save()

Could spark running multiple task in this case? Thanks

Spark submit:

spark-submit --master yarn --num-executors 3 --executor-memory 5g --jars $JARS_PATH/mongo-java-driver-3.5.0.jar,$JARS_PATH/mongodb-driver-core-3.5.0.jar,$JARS_PATH/mongo-spark-connector_2.11-2.2.1.jar spark-mongo.py

I found log that contain this INFO

INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, linxtd-itbigd04, executor 1, partition 0, PROCESS_LOCAL, 4660 bytes)
INFO BlockManagerMasterEndpoint: Registering block manager linxtd-itbigd04:36793 with 1458.6 MB RAM, BlockManagerId(1, linxtd-itbigd04, 36793, None)
INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on linxtd-itbigd04:36793 (size: 19.7 KB, free: 1458.6 MB)
INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 17364 ms on linxtd-itbigd04 (executor 1) (1/1)
4
  • And the question is? You have not asked any question. What you wrote here is a "statement" of what you did and what you "think". What would you be asking? Commented Nov 10, 2017 at 5:06
  • Share your spark submit command and your environment details Commented Nov 10, 2017 at 5:40
  • @Barry what type of tasks are we talking about here ? If you mean saving data, normally this is what's happening. The bottleneck would be network IO and how much data mongo can ingest before it crashes. Commented Nov 10, 2017 at 8:56
  • @eliasah The tasks are Spark tasks which in a Spark job. I hope that I can import my data from MSSQL into MongoDB by using Spark and doing this distributed in my cluster (cause I have 10+ billion records). But I found that Spark only create ONE job with ONE task and running on ONE Node. That's my question. Commented Nov 14, 2017 at 3:16

1 Answer 1

1

Like I suspected, and mentioned in the comments, your data wasn’t partitioned thus Spark created one task to deal with it.

You have to be careful when using the jdbc source if you don’t provide a partition reading and writing data wont be parallelized and you end up with one task.

You can read more about this topic in one of my spark gotchas - Reading data using jdbc source.

Disclaimer: I’m on of the co-authors of that repo.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.