Ingest data from http endpoint in Dataflow (python)

Question

Are you aware of any way to ingest data from an HTTP endpoint in Dataflow pipeline coded in Python?

My current solution is to schedule calls to this endpoint that retrieves JSON-formatted data, save the file on disk and have the pipeline ingest it.

What I now would like to do is Dataflow to read this HTTP endpoint on a regular basis.

I imagine you could use the requests module to make a synchronous API call within the implementation of a transform. — Andrew Nguonly
– Andrew Nguonly, Commented Mar 15, 2018 at 15:16

Lara Schmidt · Accepted Answer · 2018-03-15 20:16:57Z

1

As Andrew suggested you can try reading the data in a transform (par do). And then the data can be processed downstream.

answered Mar 15, 2018 at 20:16

Lara Schmidt

3092 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Manuel RODRIGUEZ Over a year ago

Alright. I have one interrogation though, ParDo by definition seem to be parallelized. Should I (if possible) specify not to parallelize the api call?

Lara Schmidt Over a year ago

If you have a small amount of data it's probably fine to not-parallelize it (for example send in 1 key in an in-memory PCollection). If you have a lot and an easy way to parallelize you can parallelize it as well. For example if you were reading a collection of filenames you could have a ParDo that for each key reads the file. In this way you would read all of the files. If you have some way to know which section of the data you are making RPCs for this should work fine. If not, it can be harder to parallelize.

Manuel RODRIGUEZ Over a year ago

Thanks Lara, I'll try that!

Collectives™ on Stack Overflow

Ingest data from http endpoint in Dataflow (python)

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related