5

Q1. I am trying to get a simple random sample out of a Spark dataframe (13 rows) using the sample function with parameters withReplacement: false, fraction: 0.6 but it gives me samples of different sizes every time I run it, though it work fine when I set the third parameter (seed). Why so?

Q2. How is the sample obtained after random number generation?

Thanks in advance

2 Answers 2

4

How is the sample obtained after random number generation?

Depending on a fraction you want to sample there are two different algorithms. You can check Justin's Pihony answer to SPARK Is sample method on Dataframes uniform sampling?

it gives me samples of different sizes every time I run it, though it work fine when I set the third parameter (seed). Why so?

If fraction is above RandomSampler.defaultMaxGapSamplingFraction sampling is done by a simple filter:

items.filter { _ => rng.nextDouble() <= fraction }

otherwise, simplifying things a little bit, it is repeatedly calling drop method using random integers and takes next item.

Keeping that in mind it should be obvious that a number of returned elements will be random with mean, assuming there is nothing wrong with GapSamplingIterator, equal to fraction * rdd.count. If you set seed you get the same sequence of random numbers and as a consequence the same elements are included in the sample.

Sign up to request clarification or add additional context in comments.

Comments

0

The RDD API includes takeSample, which will return a "sample of specified size in an array". It works by calling sample until it gets a sample size greater than the requested one, then randomly taking the specified number from that. The code comments that it shouldn't have to iterate often due to a bias toward large sample sizes.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.