Apply function to subset of Spark Datasets (Iteratively)

Ask Question

Asked 7 years, 8 months ago

Modified 7 years, 8 months ago

Viewed 533 times

I have a Dataset of geospatial data that I need to sample in a grid like fashion. I want divide the experiment area into a grid, and use a sampling function called "sample()" that takes three inputs, on each square of the grid, and then merge the sampled datasets back together. My current method utilized a map function, but I've learned that you can't have an RDD of RDDs/Datasets/DataFrames. So how can I apply the sampling function to subsets of my dataset? Here is the code I tried to write in map reduce fashion:

val sampleDataRDD = boundaryValuesDS.rdd.map(row => {
    val latMin = row._1
    val latMax = latMin + 0.0001
    val lonMin = row._2
    val lonMax = lonMin + 0.0001
    val filterDF = featuresDS.filter($"Latitude" > latMin).filter($"Latitude"< latMax).filter($"Longitude">lonMin).filter($"Longitude"< lonMin)
    val sampleDS = filterDF.sample(false, 0.05, 1234)
    (sampleDS)
})


val output = sampleDataDS.reduce(_ union _)

I've tried various ways of dealing with this. Converting sampleDS to an RDD and to a List, but I still continue to get a NullPointerExcpetion when calling "collect" on output.

I'm thinking I need to find a different solution, but I don't see it.

I've referenced these questions thus far:

Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset

Creating a Spark DataFrame from an RDD of lists

asked Mar 12, 2018 at 20:53

user306603

113 bronze badges

There are many ways to do this, but there's not enough info in the question to determine which one is the best one. One way would be to use partitionBy to divide the dataset in grids, but that may not be viable or skewed, it depends on the data distribution. You could also use lists, but only if each grid has a relatively small number of elements. It also depends on what you're trying to do, are you just trying to split the data into squares ? If so, partitionBy is the way to go. If not, and you have also to calculate some aggregate, you could write a UDAF

Roberto Congiu
– Roberto Congiu

2018-03-12 21:35:40 +00:00
Commented Mar 12, 2018 at 21:35
How large is boundaryValuesDS? Would it be feasable to collect it to the driver?

Shaido
– Shaido

2018-03-13 02:20:58 +00:00
Commented Mar 13, 2018 at 2:20
There's about 350,000 rows in the dataset (with 4 columns), so I think its to big for Lists, and I've tried collecting on the driver which generally crashes. @RobertoCongiu is it possible to partitionBy a certain value in the dataset? (i.e. latitude and longitude coordinates)

user306603
– user306603

2018-03-13 15:14:53 +00:00
Commented Mar 13, 2018 at 15:14
It is possible, but again, it depends on the specifics of your data, if the grid is rectangular with non-overlapping tiles and no gaps, all you need is a function f(x,y) that produces a number between 0 and the number of tiles. For instance, if your latitudes are between 20 and 40 and longitudes between 50 and 110 and you could use something like 100*(lat -20)*100/(40-20) + (long-50)*100/(110-50), this will give you 10,000 tiles, from 0 to 9999, with the first two digits giving you the row number and the last 2 the column number.

Roberto Congiu
– Roberto Congiu

2018-03-13 21:03:24 +00:00
Commented Mar 13, 2018 at 21:03

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Apply function to subset of Spark Datasets (Iteratively)

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked