I 'm trying to fill missing values in spark dataframe using PySpark. But there is not any proper way to do it. My task is to fill the missing values of some rows with respect to their previous or following rows. Concretely , I would change the 0.0 value of one row to the value of the previous row, while doing nothing on a none-zero row . I did see the Window function in spark, but it only supports some simple operation like max, min, mean, which are not suitable for my case. It would be optimal if we could have a user defined function sliding over the given Window. Does anybody have a good idea ?
-
2Please share example data, code you tried and expected output.mtoto– mtoto2016-07-17 12:01:28 +00:00Commented Jul 17, 2016 at 12:01
-
How would you define "the previous row"? Any sorting?Jacek Laskowski– Jacek Laskowski2016-11-25 12:08:05 +00:00Commented Nov 25, 2016 at 12:08
Add a comment
|
1 Answer
Use Spark window API to access previous row data. If you work on time series data, see also this package for missing data imputation.
1 Comment
Milad Khajavi
@wayag If the answer works for you, accept the answer :)