2

I 'm trying to fill missing values in spark dataframe using PySpark. But there is not any proper way to do it. My task is to fill the missing values of some rows with respect to their previous or following rows. Concretely , I would change the 0.0 value of one row to the value of the previous row, while doing nothing on a none-zero row . I did see the Window function in spark, but it only supports some simple operation like max, min, mean, which are not suitable for my case. It would be optimal if we could have a user defined function sliding over the given Window. Does anybody have a good idea ?

2
  • 2
    Please share example data, code you tried and expected output. Commented Jul 17, 2016 at 12:01
  • How would you define "the previous row"? Any sorting? Commented Nov 25, 2016 at 12:08

1 Answer 1

1

Use Spark window API to access previous row data. If you work on time series data, see also this package for missing data imputation.

Sign up to request clarification or add additional context in comments.

1 Comment

@wayag If the answer works for you, accept the answer :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.