0

I have a dockerized Flask application running locally that consists of several REST API endpoints. It's all working as expected so that when a GET request is performed on one of the endpoints, data is retrieved from the Postgres database and then displayed to the browser as json. Great. The database so far is just test data and now I need to continually update the database with real data.

I have the script that pulls data from the web, and I understand how to add it to the database with post and put requests, but what I don't understand, is how and where to have this script continually running, to where it doesn't interfere with the REST API portion of my server and vice versa, almost as though it's a completely separate entity within the backend.

To do this, would I create an entirely new flask app that runs on it's own server and is continually running the script and adding the scraped data to the database so that the other flask app which contains the API endpoints can access it when needed? I feel as though I am way off here, and any input on the best way to move forward is extremely appreciated. Thank you!

4
  • 1
    check out blog.miguelgrinberg.com/post/… Highly recommended flask tutorial. In your case checkout the Background Jobs section. Commented Mar 12, 2019 at 19:27
  • 1
    When you say 'interfere with the REST API', what do you mean? Is the rest api a bottleneck? Do you expect your scaling needs for web traffic to be much different from your scaling needs for your scraping jobs? Commented Mar 12, 2019 at 19:37
  • @aedry Thank you for the resource. It sounds like Task Queues might be exactly what I need. Commented Mar 12, 2019 at 19:40
  • @ThomasIngalls By 'interfere with the REST API', I meant so that when in development, I could stop the server, make changes to the REST API portion of the server, then rerun it, all without ever stopping and starting the scripts that are constantly scraping the web and adding to the db. And yes, the scaling needs for web traffic will be minimal while compared to the scaling needs of the scraping jobs. Thanks for pointing that out and your suggestion for two databases. Commented Mar 12, 2019 at 19:40

1 Answer 1

1

You are not far off at all in my oppinion.

I would say, let your API stand on it's own - as the gateway to your database. In it's own container.

The scraping that you want to do is another process - and you should not mix it into the flask API application. Instead, since you are already in the docker realm here - consider creating another image that will do the scraping for you. This can be a bash script, a python app - it's not important. Just as long as that you can keep it as simple as possible.

You could even consider if you can make that application/script image in such a way, that you could run multiple of them in parallel.

Yes, you will have two images to maintain. But they will each be smaller on their own, and less complex. And, if done right - you can scale the activity up if needed.

Consider the first two statements of the UNIX philosophy:

  1. Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new "features".
  2. Expect the output of every program to become the input to another, as yet unknown, program. Don't clutter output with extraneous information. Avoid stringently columnar or binary input formats. Don't insist on interactive input.

Maintainability is king in the game of software development. Big cluttered projects have a hard time surviving in the long run.

After thought: If your project is experimental and you just want to prove some concept - then do that. And don't overthink the design. Too many projects die from that too!

Those are my thought at least.

Sign up to request clarification or add additional context in comments.

3 Comments

This is exactly what I was hoping to hear. Not mixing the scraping into the flask API application sounds way less messy, especially since the scraping is going to end up being quite extensive. I will go with two images. 'You could even consider if you can make that application/script image in such a way, that you could run multiple of them in parallel.' - This is exactly what I need to do, but I am a little unsure of how to go about it. Any direction you could maybe point me to for doing something like this? Thank you very much, your answer has helped a ton.
It depends a bit on how you are managing the scraping input. Is it a list of URLs? Are you going recursively in a depth of n? If you have a number of containers that are scraping for you, and you don't want them to overlap on eachothers scrapes - they will need to work from a single queue/list of input. You then need some transactional logic to protect the queue, and hand out batches of tasks for them. How do you get the input for the scrapers?
A single queue/list of input sounds like what is needed, since I'll be scraping from a list of urls, which new urls are continually being added too based on which websites are in the database.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.