I'm trying to run an experiment that involves transforming a lot of files through a pipeline like this. A and B take a file as input and produces a file, or some text through stdout. C takes those inputs and put them into a database. (A, B, and C are local CLI programs, not servers btw.)
File ---------> A ----->
| C ---> Database
+---> B ----->
I want:
- to queue thousands of files and process them through this pipeline parallelly
- web UI (or some human-friendly CLI) that allows me to
- monitor progress i.e. how many tasks have been processed/failed (task = an instance of the above pipeline for each input file)
- inspect failed tasks (input, error logs, etc.)
- manually re-enqueue failed tasks (I don't need auto retry)
- (optional) pause/resume/empty the queue (It doesn't need to be able to pause ongoing tasks)
(Note that A and B inside each task don't need to run parallelly, and that the task scheduler doesn't have to be aware of the structure of the pipeline. I could write the whole pipeline in a simple sequential bash script that takes a file name as an argument, and just schedule the script for each input file)
Are there any good free open-source CLI programs, or libraries (preferably in python) for this use case?
I checked out Airflow but it seems like it's meant for running exactly the same static task (DAG) repeatedly, NOT for processing a lot of different files through a single DAG... I'm also trying out Dramatiq but it doesn't seem to allow you to easily re-run failed tasks (which is maybe more of a limitation of RabbitMQ?).
I'm not very familiar with these kinds of tools so I'm sorry if I'm asking something stupid.
I checked out Airflow but it seems like it's meant for running exactly the same static task (DAG) repeatedly, NOT? Check airflow. The list of tasks can be generated...