0

I'm having some trouble understanding how Spark allows for scheduling of jobs. I have a series of jobs I'd like to run in sequence. From what I've read, I can submit any number of jobs to spark-submit and it will manage scheduling automatically based on available resources, but I want to guarantee that the jobs will run in order, waiting for the previous job to complete. I understand that I can write a script that just submits the jobs one after another, but I'm wondering if Spark has a built-in mechanism to handle these kinds of submissions.

What's more, I have several of these series of jobs. Supposing I have a series of jobs A -> B -> C and another D -> E -> F, I'd be fine with any one of A, B, or C running concurrently with any of D, E, or F, but not with any of A, B, or C running concurrently with any of A, B, or C. Does Spark have a built-in mechanism to handle this use case?

I've read a little about yarn's queueing mechanism allowing for multiple queues, but I'm not sure if this is the solution I'm looking for.

Thanks!

1
  • 1
    That's not Spark or Yarn's purpose. You need to use a task scheduler/ workflow tool to do that. Airflow, Azkaban, ... There are plenty. Commented Feb 25, 2023 at 12:25

1 Answer 1

1

Yarn role is to distribute resources among your job.

If you submit all your job in the same time, they will start in different order, based on the resources requested, the queue priority, the queue strategy (fifo or fair) and so on.

What you can do is making 3 differents queues with different priority and submit all job in same time, but that seam pretty dangerous.

You are basically looking for a scheduler like airflow or Oozie

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.