Structured Streaming Python API

Question

In the doc it says that Stateful Operations like mapGroupsWithState in Structured Streaming supported only in Scala and Java but I do need statful capabilities in Python. What should I do?

maverik · Accepted Answer · 2018-04-13 21:52:34Z

6

If you insist on using Pyspark -

Perform the preprocessing action in one spark job, then store the necessary "state" stream to a file sink. In another job, read this stream and perform the output action. There's an extra memory/disk/latency overhead involved.
Use updateStateByKey API instead. This will require DStreams approach instead of Structured Streaming.

Neither approach is great. If you need the latest and the greatest API features, I'd recommend transitioning to Scala now. As your project progresses, you will run into this problem repeatedly. Since Spark is written in Scala, the Python API always lags behind.

answered Apr 13, 2018 at 21:52

maverik

7866 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Structured Streaming Python API

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related