Spark - sample() function duplicating data?

Question

I want to randomly select a subset of my data and then limit it to 200 entries. But after using the sample() function, I'm getting duplicate rows, and I don't know why. Let me show you:

DataFrame df= sqlContext.sql("SELECT * " +
        "                     FROM temptable" +
        "                     WHERE conditions");
DataFrame df1 = df.select(df.col("col1"))
        .where(df.col("col1").isNotNull())
        .distinct()
        .orderBy(df.col("col1"));
df.show();
System.out.println(df.count());

Up until now, everything is OK. I get the output:

+-----------+
|col1       |                                                                                                                                                                           
+-----------+                                                                                                                                                                           
|      10016|                                                                                                                                                                           
|      10022|                                                                                                                                                                           
|     100281|                                                                                                                                                                           
|      10032|                                                                                                                                                                           
|     100427|                                                                                                                                                                           
|     100445|                                                                                                                                                                           
|      10049|                                                                                                                                                                           
|      10070|                                                                                                                                                                           
|      10076|                                                                                                                                                                           
|      10079|                                                                                                                                                                           
|      10081|                                                                                                                                                                           
|      10082|                                                                                                                                                                           
|     100884|                                                                                                                                                                           
|      10092|                                                                                                                                                                           
|      10099|                                                                                                                                                                           
|      10102|                                                                                                                                                                           
|      10103|                                                                                                                                                                           
|     101039|                                                                                                                                                                           
|     101134|                                                                                                                                                                           
|     101187|                                                                                                                                                                           
+-----------+                                                                                                                                                                           
only showing top 20 rows 

10512

with 10512 records without duplicates. AND THEN!

df = df.sample(true, 0.5).limit(200);
df.show();
System.out.println(users.count());

This returns 200 rows full of duplicates:

+-----------+
|col1       |                                                                                                                                                                           
+-----------+                                                                                                                                                                           
|      10022|                                                                                                                                                                           
|     100445|                                                                                                                                                                           
|     100445|                                                                                                                                                                           
|      10049|                                                                                                                                                                           
|      10079|                                                                                                                                                                           
|      10079|                                                                                                                                                                           
|      10081|                                                                                                                                                                           
|      10081|                                                                                                                                                                           
|      10082|                                                                                                                                                                           
|      10092|                                                                                                                                                                           
|      10102|                                                                                                                                                                           
|      10102|                                                                                                                                                                           
|     101039|                                                                                                                                                                           
|     101134|                                                                                                                                                                           
|     101134|                                                                                                                                                                           
|     101134|                                                                                                                                                                           
|     101345|                                                                                                                                                                           
|     101345|                                                                                                                                                                           
|      10140|                                                                                                                                                                           
|      10141|                                                                                                                                                                           
+-----------+                                                                                                                                                                           
only showing top 20 rows                                                                                                                                                                

200

Can anyone tell me why? This is driving me crazy. Thank you!

2 revs · Accepted Answer · 2016-09-09 15:30:32Z

2

You explicitly ask for a sample with replacement so there is nothing unexpected about getting duplicates:

public Dataset<T> sample(boolean withReplacement, double fraction)

Collectives™ on Stack Overflow

Spark - sample() function duplicating data?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related