0

I'm new to Pub/Sub and Dataflow/Beam. I have done a task in Spark and Kafka, I want to do the same using Pub/Sub and Dataflow/Beam. From what I understood so far Kafka is similar to Pub/Sub and Spark is similar to Dataflow/Beam.

The task is take a JSON file and write to a Pub/Sub topic. Then using Beam/Dataflow I need to get that data into a PCollection. How will I achieve this?

2
  • The Apache Beam Python SDK does not support reading from Pub/Sub. Reference: Built-in I/O Transforms. Commented Mar 15, 2018 at 15:12
  • 1
    Beam-PubSub What about this? Commented Mar 15, 2018 at 15:15

2 Answers 2

8

I solved the above problem. I'm able to continuously read data from a pubsub topic and then do some processing and then write the result to a datastore.

with beam.Pipeline(options=options) as p:

    # Read from PubSub into a PCollection.
    lines = p | beam.io.ReadStringsFromPubSub(topic=known_args.input_topic)

    # Group and aggregate each JSON object.
    transformed = (lines
                   | 'Split' >> beam.FlatMap(lambda x: x.split("\n"))
                   | 'jsonParse' >> beam.ParDo(jsonParse())
                   | beam.WindowInto(window.FixedWindows(15,0))
                   | 'Combine' >> beam.CombinePerKey(sum))

    # Create Entity.
    transformed = transformed | 'create entity' >> beam.Map(
      EntityWrapper(config.NAMESPACE, config.KIND, config.ANCESTOR).make_entity)

    # Write to Datastore.
    transformed | 'write to datastore' >> WriteToDatastore(known_args.dataset_id)
Sign up to request clarification or add additional context in comments.

1 Comment

which runner do u use? it seems this doesn't work with SparkRunner and FlinkRunner, pls help to look at this question: beam.apache.org/documentation/runners/spark/… a bit if u r still on this domain, many thanks
2

Pubsub is a streaming source/ sink (it doesn't make sense to read/write to it only once). Dataflow python SDK support for streaming is not yet available.

Documentation: https://cloud.google.com/dataflow/release-notes/release-notes-python.

Once streaming is available, you should be able to do this pretty trivially.

However if you are going from file -> pubsub and then pubsub -> pcollection you should be able to do this with a batch pipeline and drop out the pubsub aspect. You can look at the basic file io for beam.

2 Comments

If you are interested in python streaming you can email [email protected] for questions on when it will be available.
This is now supported in Python: cloud.google.com/blog/products/data-analytics/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.