0

I want to randomly select a subset of my data and then limit it to 200 entries. But after using the sample() function, I'm getting duplicate rows, and I don't know why. Let me show you:

DataFrame df= sqlContext.sql("SELECT * " +
        "                     FROM temptable" +
        "                     WHERE conditions");
DataFrame df1 = df.select(df.col("col1"))
        .where(df.col("col1").isNotNull())
        .distinct()
        .orderBy(df.col("col1"));
df.show();
System.out.println(df.count());

Up until now, everything is OK. I get the output:

+-----------+
|col1       |                                                                                                                                                                           
+-----------+                                                                                                                                                                           
|      10016|                                                                                                                                                                           
|      10022|                                                                                                                                                                           
|     100281|                                                                                                                                                                           
|      10032|                                                                                                                                                                           
|     100427|                                                                                                                                                                           
|     100445|                                                                                                                                                                           
|      10049|                                                                                                                                                                           
|      10070|                                                                                                                                                                           
|      10076|                                                                                                                                                                           
|      10079|                                                                                                                                                                           
|      10081|                                                                                                                                                                           
|      10082|                                                                                                                                                                           
|     100884|                                                                                                                                                                           
|      10092|                                                                                                                                                                           
|      10099|                                                                                                                                                                           
|      10102|                                                                                                                                                                           
|      10103|                                                                                                                                                                           
|     101039|                                                                                                                                                                           
|     101134|                                                                                                                                                                           
|     101187|                                                                                                                                                                           
+-----------+                                                                                                                                                                           
only showing top 20 rows 

10512

with 10512 records without duplicates. AND THEN!

df = df.sample(true, 0.5).limit(200);
df.show();
System.out.println(users.count());

This returns 200 rows full of duplicates:

+-----------+
|col1       |                                                                                                                                                                           
+-----------+                                                                                                                                                                           
|      10022|                                                                                                                                                                           
|     100445|                                                                                                                                                                           
|     100445|                                                                                                                                                                           
|      10049|                                                                                                                                                                           
|      10079|                                                                                                                                                                           
|      10079|                                                                                                                                                                           
|      10081|                                                                                                                                                                           
|      10081|                                                                                                                                                                           
|      10082|                                                                                                                                                                           
|      10092|                                                                                                                                                                           
|      10102|                                                                                                                                                                           
|      10102|                                                                                                                                                                           
|     101039|                                                                                                                                                                           
|     101134|                                                                                                                                                                           
|     101134|                                                                                                                                                                           
|     101134|                                                                                                                                                                           
|     101345|                                                                                                                                                                           
|     101345|                                                                                                                                                                           
|      10140|                                                                                                                                                                           
|      10141|                                                                                                                                                                           
+-----------+                                                                                                                                                                           
only showing top 20 rows                                                                                                                                                                

200

Can anyone tell me why? This is driving me crazy. Thank you!

1 Answer 1

2

You explicitly ask for a sample with replacement so there is nothing unexpected about getting duplicates:

public Dataset<T> sample(boolean withReplacement, double fraction)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.