Dataflow Job to start based on PubSub Notification - Python

Question

I am writing a Dataflow job which reads from BigQuery and does a few transformations.

data = (
    pipeline
    | beam.io.ReadFromBigQuery(query='''
    SELECT * FROM `bigquery-public-data.chicago_crime.crime` LIMIT 100
    ''', use_standard_sql=True)
    | beam.Map(print)
)

But my requirement is to read from BigQuery only after receiving a notification from a PubSub Topic. The above DataFlow job should start reading data from BigQuery only if the below message is received. If it is a different job id or a different status, then no action should be done.

PubSub Message : {'job_id':101, 'status': 'Success'}

Any help on this part?

Alternatively, you can use Beam's Wait transform (example here: rmannibucau.metawerx.net/post/…) (I am not the author of the post.) — Szere Dyeri
– Szere Dyeri, Commented Dec 21, 2022 at 17:46

CaptainNabla · Accepted Answer · 2023-01-16 07:51:21Z

5

That is fairly easy, the code would look like this

pubsub_msg = (
   pipeline
   | beam.io.gcp.pubsub.ReadFromPubSub(topic=my_topic, subscription=my_subscription)
)

bigquery_data = (
    pubsub_msg
    | beam.Filter(lambda msg: msg['job_id']==101)   # you might want to use a more sophisticated filter condition
    | beam.io.ReadFromBigQuery(query='''
    SELECT * FROM `bigquery-public-data.chicago_crime.crime` LIMIT 100
    ''', use_standard_sql=True)
)
bigquery_data | beam.Map(print)

However, if you do it like that you will have a streaming DataFlow job running (indefinitely, or until you cancel the job), since using ReadFromPubSub results automatically in a streaming job. Consequently, this does not start a Dataflow job, when a message is arriving in PubSub, but rather one job is already running and listening to the topic for something to do.

If you actually want to trigger a Dataflow batch job, I would recommend using a Dataflow template, and starting this template with a Cloud Function which listens to your PubSub topic. The logic of the filtering would then be within this CloudFunction (as a simple if condition).

edited Jan 16, 2023 at 7:51

answered Dec 21, 2022 at 7:10

CaptainNabla

1,1761 gold badge8 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ashok KS Over a year ago

Thanks for your answer. But how can I check if the status is 'Success' for the job_id and only then proceed to the next step? The code looks like it will move to the ReadFromBigQuery task even if the Filter doesn't give any results.

CaptainNabla Over a year ago

beam.Filter expects a custom function that returns either true or false. Only those elements which return true are propagated down the line. So if your input returns always false, the ReadFromBigQuery is never executed. You may use any complex custom function you like (including checking on status in your case), see the documentation.

Ashok KS · Accepted Answer · 2023-01-16 05:47:15Z

1

I ended up using Cloud Functions, added the filtering logic in it and starting the Dataflow from there. Found the below link useful. How to trigger a dataflow with a cloud function? (Python SDK)

answered Jan 16, 2023 at 5:47

Ashok KS

7017 silver badges25 bronze badges

Collectives™ on Stack Overflow

Dataflow Job to start based on PubSub Notification - Python

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related