2

I have a pyspark dataframe like: where c1,c2,c3,c4,c5,c6 are the columns

   +----------------------------+   
   |c1 | c2 | c3 | c4 | c5 | c6 |  
   |----------------------------|   
   | a |  x |  y |  z |  g |  h |    
   | b |  m |  f |  l |  n |  o |    
   | c |  x |  y |  z |  g |  h |    
   | d |  m |  f |  l |  n |  o |    
   | e |  x |  y |  z |  g |  i |   
   +----------------------------+

I want to extract c1 values for the rows which have same c2,c3,c4,c5 values but different c1 values. Like, 1st, 3rd & 5th rows have same values for c2,c3,c4 & c5 but different c1 value. So the output should be a, c & e.
(update) similarly, 2nd & 4th rows have same values for c2,c3,c4 & c5 but different c1 value. So the output should also contain b & d

How can I obtain such result ? I have tried applying groupby but I don't understand how to obtain distinct values for c1.

UPDATE:

Output should be a Dataframe of c1 values

# +-------+
# |c1_dups|
# +-------+
# |  a,c,e|
# |    b,e|
# +-------+   

My Approach:

m = data.groupBy('c2','c3','c4','c5)

but I'm not understanding how to retrieve the values in m. I'm new to pyspark dataframes hence very much confused

4
  • It's a little bit hard to understand your issue. Can you write the output dataframe example? Commented Sep 1, 2016 at 11:11
  • Thanks :) Please check my update Commented Sep 1, 2016 at 11:29
  • can you update with group by approach that you have tried ? Commented Sep 1, 2016 at 11:50
  • It's a minimal approach. but yeah, updated Commented Sep 1, 2016 at 12:04

1 Answer 1

6

This is actually very simple, let's create some data first :

schema = ['c1','c2','c3','c4','c5','c6']

rdd = sc.parallelize(["a,x,y,z,g,h","b,x,y,z,l,h","c,x,y,z,g,h","d,x,f,y,g,i","e,x,y,z,g,i"]) \
        .map(lambda x : x.split(","))

df = sqlContext.createDataFrame(rdd,schema)
# +---+---+---+---+---+---+
# | c1| c2| c3| c4| c5| c6|
# +---+---+---+---+---+---+
# |  a|  x|  y|  z|  g|  h|
# |  b|  x|  y|  z|  l|  h|
# |  c|  x|  y|  z|  g|  h|
# |  d|  x|  f|  y|  g|  i|
# |  e|  x|  y|  z|  g|  i|
# +---+---+---+---+---+---+

Now the fun part, you'll just need to import some functions, group by and explode as following :

from pyspark.sql.functions import *

dupes = df.groupBy('c2','c3','c4','c5') \ 
          .agg(collect_list('c1').alias("c1s"),count('c1').alias("count")) \ # we collect as list and count at the same time
          .filter(col('count') > 1) # we filter dupes

df2 = dupes.select(explode("c1s").alias("c1_dups"))

df2.show()
# +-------+
# |c1_dups|
# +-------+
# |      a|
# |      c|
# |      e|
# +-------+

I hope this answers your question.

Sign up to request clarification or add additional context in comments.

2 Comments

But what if I have multiple such duplicates, Check the question for my edit in desired output
I believe that you should ask a follow up question instead of the edit. It will be a mess otherwise.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.