30

I'm messing around with dataframes in pyspark 1.4 locally and am having issues getting the dropDuplicates method to work. It keeps returning the error:

"AttributeError: 'list' object has no attribute 'dropDuplicates'"

Not quite sure why as I seem to be following the syntax in the latest documentation.

#loading the CSV file into an RDD in order to start working with the data
rdd1 = sc.textFile("C:\myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect()

#loading the RDD object into a dataframe and assigning column names
df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect()

#dropping duplicates from the dataframe
df1.dropDuplicates().show()

3 Answers 3

50

It is not an import problem. You simply call .dropDuplicates() on a wrong object. While class of sqlContext.createDataFrame(rdd1, ...) is pyspark.sql.dataframe.DataFrame, after you apply .collect() it is a plain Python list, and lists don't provide dropDuplicates method. What you want is something like this:

 (df1 = sqlContext
     .createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])
     .dropDuplicates())

 df1.collect()
Sign up to request clarification or add additional context in comments.

Comments

25

if you have a data frame and want to remove all duplicates -- with reference to duplicates in a specific column (called 'colName'):

count before dedupe:

df.count()

do the de-dupe (convert the column you are de-duping to string type):

from pyspark.sql.functions import col
df = df.withColumn('colName',col('colName').cast('string'))

df.drop_duplicates(subset=['colName']).count()

can use a sorted groupby to check to see that duplicates have been removed:

df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)

Comments

3

In summary, distinct() and dropDuplicates() methods remove duplicates with one difference, which is essential.

dropDuplicates() is more suitable by considering only a subset of the columns

data = [("James","","Smith","36636","M",60000),
        ("James","Rose","","40288","M",70000),
        ("Robert","","Williams","42114","",400000),
        ("Maria","Anne","Jones","39192","F",500000),
        ("Maria","Mary","Brown","","F",0)]

columns = ["first_name","middle_name","last_name","dob","gender","salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)

df.groupBy('first_name').agg(count(
  'first_name').alias("count_duplicates")).filter(
  col('count_duplicates') >= 2).show()

df.dropDuplicates(['first_name']).show()

# output

+----------+-----------+---------+-----+------+------+
|first_name|middle_name|last_name|dob  |gender|salary|
+----------+-----------+---------+-----+------+------+
|James     |           |Smith    |36636|M     |60000 |
|James     |Rose       |         |40288|M     |70000 |
|Robert    |           |Williams |42114|      |400000|
|Maria     |Anne       |Jones    |39192|F     |500000|
|Maria     |Mary       |Brown    |     |F     |0     |
+----------+-----------+---------+-----+------+------+

+----------+----------------+
|first_name|count_duplicates|
+----------+----------------+
|     James|               2|
|     Maria|               2|
+----------+----------------+

+----------+-----------+---------+-----+------+------+
|first_name|middle_name|last_name|  dob|gender|salary|
+----------+-----------+---------+-----+------+------+
|     James|           |    Smith|36636|     M| 60000|
|     Maria|       Anne|    Jones|39192|     F|500000|
|    Robert|           | Williams|42114|      |400000|
+----------+-----------+---------+-----+------+------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.