How to use Google Pub/Sub with Google Dataflow/Beam using Python?

Question

I'm new to Pub/Sub and Dataflow/Beam. I have done a task in Spark and Kafka, I want to do the same using Pub/Sub and Dataflow/Beam. From what I understood so far Kafka is similar to Pub/Sub and Spark is similar to Dataflow/Beam.

The task is take a JSON file and write to a Pub/Sub topic. Then using Beam/Dataflow I need to get that data into a PCollection. How will I achieve this?

The Apache Beam Python SDK does not support reading from Pub/Sub. Reference: Built-in I/O Transforms. — Andrew Nguonly
– Andrew Nguonly, Commented Mar 15, 2018 at 15:12

Minato · Accepted Answer · 2018-03-21 05:52:00Z

8

I solved the above problem. I'm able to continuously read data from a pubsub topic and then do some processing and then write the result to a datastore.

with beam.Pipeline(options=options) as p:

    # Read from PubSub into a PCollection.
    lines = p | beam.io.ReadStringsFromPubSub(topic=known_args.input_topic)

    # Group and aggregate each JSON object.
    transformed = (lines
                   | 'Split' >> beam.FlatMap(lambda x: x.split("\n"))
                   | 'jsonParse' >> beam.ParDo(jsonParse())
                   | beam.WindowInto(window.FixedWindows(15,0))
                   | 'Combine' >> beam.CombinePerKey(sum))

    # Create Entity.
    transformed = transformed | 'create entity' >> beam.Map(
      EntityWrapper(config.NAMESPACE, config.KIND, config.ANCESTOR).make_entity)

    # Write to Datastore.
    transformed | 'write to datastore' >> WriteToDatastore(known_args.dataset_id)

answered Mar 21, 2018 at 5:52

Minato

4625 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

lnshi Over a year ago

which runner do u use? it seems this doesn't work with SparkRunner and FlinkRunner, pls help to look at this question: beam.apache.org/documentation/runners/spark/… a bit if u r still on this domain, many thanks

touch my body · Accepted Answer · 2018-04-30 16:27:08Z

2

Pubsub is a streaming source/ sink (it doesn't make sense to read/write to it only once). Dataflow python SDK support for streaming is not yet available.

Documentation: https://cloud.google.com/dataflow/release-notes/release-notes-python.

Once streaming is available, you should be able to do this pretty trivially.

However if you are going from file -> pubsub and then pubsub -> pcollection you should be able to do this with a batch pipeline and drop out the pubsub aspect. You can look at the basic file io for beam.

edited Apr 30, 2018 at 16:27

touch my body

1,73323 silver badges37 bronze badges

answered Mar 15, 2018 at 20:58

Lara Schmidt

3092 silver badges6 bronze badges

2 Comments

Lara Schmidt Over a year ago

If you are interested in python streaming you can email [email protected] for questions on when it will be available.

michael_erasmus Over a year ago

This is now supported in Python: cloud.google.com/blog/products/data-analytics/…

Collectives™ on Stack Overflow

How to use Google Pub/Sub with Google Dataflow/Beam using Python?

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related