0

I have a Dataset of geospatial data that I need to sample in a grid like fashion. I want divide the experiment area into a grid, and use a sampling function called "sample()" that takes three inputs, on each square of the grid, and then merge the sampled datasets back together. My current method utilized a map function, but I've learned that you can't have an RDD of RDDs/Datasets/DataFrames. So how can I apply the sampling function to subsets of my dataset? Here is the code I tried to write in map reduce fashion:

val sampleDataRDD = boundaryValuesDS.rdd.map(row => {
    val latMin = row._1
    val latMax = latMin + 0.0001
    val lonMin = row._2
    val lonMax = lonMin + 0.0001
    val filterDF = featuresDS.filter($"Latitude" > latMin).filter($"Latitude"< latMax).filter($"Longitude">lonMin).filter($"Longitude"< lonMin)
    val sampleDS = filterDF.sample(false, 0.05, 1234)
    (sampleDS)
})


val output = sampleDataDS.reduce(_ union _)

I've tried various ways of dealing with this. Converting sampleDS to an RDD and to a List, but I still continue to get a NullPointerExcpetion when calling "collect" on output.

I'm thinking I need to find a different solution, but I don't see it.

I've referenced these questions thus far:

Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset

Creating a Spark DataFrame from an RDD of lists

4
  • There are many ways to do this, but there's not enough info in the question to determine which one is the best one. One way would be to use partitionBy to divide the dataset in grids, but that may not be viable or skewed, it depends on the data distribution. You could also use lists, but only if each grid has a relatively small number of elements. It also depends on what you're trying to do, are you just trying to split the data into squares ? If so, partitionBy is the way to go. If not, and you have also to calculate some aggregate, you could write a UDAF Commented Mar 12, 2018 at 21:35
  • How large is boundaryValuesDS? Would it be feasable to collect it to the driver? Commented Mar 13, 2018 at 2:20
  • There's about 350,000 rows in the dataset (with 4 columns), so I think its to big for Lists, and I've tried collecting on the driver which generally crashes. @RobertoCongiu is it possible to partitionBy a certain value in the dataset? (i.e. latitude and longitude coordinates) Commented Mar 13, 2018 at 15:14
  • It is possible, but again, it depends on the specifics of your data, if the grid is rectangular with non-overlapping tiles and no gaps, all you need is a function f(x,y) that produces a number between 0 and the number of tiles. For instance, if your latitudes are between 20 and 40 and longitudes between 50 and 110 and you could use something like 100*(lat -20)*100/(40-20) + (long-50)*100/(110-50), this will give you 10,000 tiles, from 0 to 9999, with the first two digits giving you the row number and the last 2 the column number. Commented Mar 13, 2018 at 21:03

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.