Convert spark Rdd column to rows in Pyspark [duplicate]

Question

I have a spark Rdd which is of form Row(id,Words) Where words contains a list of words. I want to convert this list into a single column. Input

ID  Words
1   [w1,w2,w3]
2   [w3,w4]

I want to convert this to the output format

ID  Word
1   w1
1   w2
1   w3
2   w3
2   w4

pault · Accepted Answer · 2018-02-13 23:02:31Z

5

If you want to work rdd, you need to use flatMap():

rdd.flatMap(lambda x: [(x['ID'], w) for w in x["Words"]]).collect()
#[(1, u'w1'), (1, u'w2'), (1, u'w3'), (2, u'w3'), (2, u'w4')]

However, if you are open to using DataFrames (recommended) you can use pyspark.sql.functions.explode:

import pyspark.sql.functions as f
df = rdd.toDF()
df.select('ID', f.explode("Words").alias("Word")).show()
#+---+----+
#| ID|Word|
#+---+----+
#|  1|  w1|
#|  1|  w2|
#|  1|  w3|
#|  2|  w3|
#|  2|  w4|
#+---+----+

Or better yet, skip the rdd all together and create a DataFrame directly:

data = [
    (1, ['w1','w2','w3']),
    (2, ['w3','w4'])
]
df = sqlCtx.createDataFrame(data, ["ID", "Words"])
df.show()
#+---+------------+
#| ID|       Words|
#+---+------------+
#|  1|[w1, w2, w3]|
#|  2|    [w3, w4]|
#+---+------------+

edited Feb 13, 2018 at 23:02

answered Feb 13, 2018 at 22:44

pault

43.7k17 gold badges121 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

vish Over a year ago

Flat map function creates multiple lists,I don't want the output to be in list format. Can I use explode function on two columns at once.I got another column which contains the scores of words.So I have to explode both the colums together

pault Over a year ago

@vish I don't understand. flatMap() should not create multiple lists. I think the best thing for you to do is to edit your question with your actual use case (or ask a new question). It's hard to give an answer without seeing the inputs and desired output but you may be looking for rdd.flatMap(lambda x: [(x['ID'], w, s) for w, s in zip(x["Words"], x["Scores"])]).collect()

Collectives™ on Stack Overflow

Convert spark Rdd column to rows in Pyspark [duplicate]

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related