4

In the doc it says that Stateful Operations like mapGroupsWithState in Structured Streaming supported only in Scala and Java but I do need statful capabilities in Python. What should I do?

1 Answer 1

6

If you insist on using Pyspark -

  1. Perform the preprocessing action in one spark job, then store the necessary "state" stream to a file sink. In another job, read this stream and perform the output action. There's an extra memory/disk/latency overhead involved.

  2. Use updateStateByKey API instead. This will require DStreams approach instead of Structured Streaming.

Neither approach is great. If you need the latest and the greatest API features, I'd recommend transitioning to Scala now. As your project progresses, you will run into this problem repeatedly. Since Spark is written in Scala, the Python API always lags behind.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.