4

Are you aware of any way to ingest data from an HTTP endpoint in Dataflow pipeline coded in Python?

My current solution is to schedule calls to this endpoint that retrieves JSON-formatted data, save the file on disk and have the pipeline ingest it.

What I now would like to do is Dataflow to read this HTTP endpoint on a regular basis.

1
  • 1
    I imagine you could use the requests module to make a synchronous API call within the implementation of a transform. Commented Mar 15, 2018 at 15:16

1 Answer 1

1

As Andrew suggested you can try reading the data in a transform (par do). And then the data can be processed downstream.

Sign up to request clarification or add additional context in comments.

3 Comments

Alright. I have one interrogation though, ParDo by definition seem to be parallelized. Should I (if possible) specify not to parallelize the api call?
If you have a small amount of data it's probably fine to not-parallelize it (for example send in 1 key in an in-memory PCollection). If you have a lot and an easy way to parallelize you can parallelize it as well. For example if you were reading a collection of filenames you could have a ParDo that for each key reads the file. In this way you would read all of the files. If you have some way to know which section of the data you are making RPCs for this should work fine. If not, it can be harder to parallelize.
Thanks Lara, I'll try that!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.