I have a Dataset of geospatial data that I need to sample in a grid like fashion. I want divide the experiment area into a grid, and use a sampling function called "sample()" that takes three inputs, on each square of the grid, and then merge the sampled datasets back together. My current method utilized a map function, but I've learned that you can't have an RDD of RDDs/Datasets/DataFrames. So how can I apply the sampling function to subsets of my dataset? Here is the code I tried to write in map reduce fashion:
val sampleDataRDD = boundaryValuesDS.rdd.map(row => {
val latMin = row._1
val latMax = latMin + 0.0001
val lonMin = row._2
val lonMax = lonMin + 0.0001
val filterDF = featuresDS.filter($"Latitude" > latMin).filter($"Latitude"< latMax).filter($"Longitude">lonMin).filter($"Longitude"< lonMin)
val sampleDS = filterDF.sample(false, 0.05, 1234)
(sampleDS)
})
val output = sampleDataDS.reduce(_ union _)
I've tried various ways of dealing with this. Converting sampleDS to an RDD and to a List, but I still continue to get a NullPointerExcpetion when calling "collect" on output.
I'm thinking I need to find a different solution, but I don't see it.
I've referenced these questions thus far:
Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset
partitionByto divide the dataset in grids, but that may not be viable or skewed, it depends on the data distribution. You could also use lists, but only if each grid has a relatively small number of elements. It also depends on what you're trying to do, are you just trying to split the data into squares ? If so,partitionByis the way to go. If not, and you have also to calculate some aggregate, you could write a UDAFboundaryValuesDS? Would it be feasable to collect it to the driver?partitionBya certain value in the dataset? (i.e. latitude and longitude coordinates)f(x,y)that produces a number between 0 and the number of tiles. For instance, if your latitudes are between 20 and 40 and longitudes between 50 and 110 and you could use something like100*(lat -20)*100/(40-20) + (long-50)*100/(110-50), this will give you 10,000 tiles, from 0 to 9999, with the first two digits giving you the row number and the last 2 the column number.