In the doc it says that Stateful Operations like mapGroupsWithState in Structured Streaming supported only in Scala and Java but I do need statful capabilities in Python. What should I do?
1 Answer
If you insist on using Pyspark -
Perform the preprocessing action in one spark job, then store the necessary "state" stream to a file sink. In another job, read this stream and perform the output action. There's an extra memory/disk/latency overhead involved.
Use updateStateByKey API instead. This will require DStreams approach instead of Structured Streaming.
Neither approach is great. If you need the latest and the greatest API features, I'd recommend transitioning to Scala now. As your project progresses, you will run into this problem repeatedly. Since Spark is written in Scala, the Python API always lags behind.