0

I'm trying to run an experiment that involves transforming a lot of files through a pipeline like this. A and B take a file as input and produces a file, or some text through stdout. C takes those inputs and put them into a database. (A, B, and C are local CLI programs, not servers btw.)

File ---------> A ----->
          |                C   ---> Database
          +---> B ----->

I want:

  • to queue thousands of files and process them through this pipeline parallelly
  • web UI (or some human-friendly CLI) that allows me to
    • monitor progress i.e. how many tasks have been processed/failed (task = an instance of the above pipeline for each input file)
    • inspect failed tasks (input, error logs, etc.)
    • manually re-enqueue failed tasks (I don't need auto retry)
    • (optional) pause/resume/empty the queue (It doesn't need to be able to pause ongoing tasks)

(Note that A and B inside each task don't need to run parallelly, and that the task scheduler doesn't have to be aware of the structure of the pipeline. I could write the whole pipeline in a simple sequential bash script that takes a file name as an argument, and just schedule the script for each input file)

Are there any good free open-source CLI programs, or libraries (preferably in python) for this use case?

I checked out Airflow but it seems like it's meant for running exactly the same static task (DAG) repeatedly, NOT for processing a lot of different files through a single DAG... I'm also trying out Dramatiq but it doesn't seem to allow you to easily re-run failed tasks (which is maybe more of a limitation of RabbitMQ?).

I'm not very familiar with these kinds of tools so I'm sorry if I'm asking something stupid.

3
  • Looks like an ETL task. So ETL tool could probably help (BO Data Integrator or IBM DataStage, for example). If not - write a program. Perl would be my choice. Commented Jul 1, 2022 at 12:30
  • @WhiteOwl Thanks for the info. I should've clarified I'm looking for free open source programs... Looks like those two are paid software? Anyway I'll look into free ETL tools (I didn't even know the term) Commented Jul 2, 2022 at 8:07
  • I checked out Airflow but it seems like it's meant for running exactly the same static task (DAG) repeatedly, NOT ? Check airflow. The list of tasks can be generated... Commented Jul 3, 2022 at 11:10

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.