pyspark dataframe from rdd containing key and values as list of lists

Question

I have a RDD like below with keys and values as list of list containing some parameters.

(32719, [[[u'200.73.55.34', u'192.16.48.217', 0, 6, 10163, 443, 0], [u'177.207.76.243', u'192.16.58.8', 0, 6, 59575, 80, 0]])
(32897, [[[u'200.73.55.34', u'193.16.48.217', 0, 6, 10163, 443, 0], [u'167.207.76.243', u'194.16.58.8', 0, 6, 59575, 80, 0]])

I want to create a dataframe with rows and columns as below

32719, '200.73.55.34', u'192.16.48.217', 0, 6, 10163, 443, 0
32719, '177.207.76.243', u'192.16.58.8', 0, 6, 59575, 80, 0
32897, 200.73.55.34', u'193.16.48.217', 0, 6, 10163, 443, 0

Or just a dataframe of all the values but grouped by the key. How can I do this.

Himaprasoon · Accepted Answer · 2017-03-31 13:12:15Z

3

Use flat map Values

a =[(32719, [[[u'200.73.55.34', u'192.16.48.217', 0, 6, 10163, 443, 0], [u'177.207.76.243', u'192.16.58.8', 0, 6, 59575, 80, 0]]]),
(32897, [[[u'200.73.55.34', u'193.16.48.217', 0, 6, 10163, 443, 0], [u'167.207.76.243', u'194.16.58.8', 0, 6, 59575, 80, 0]]])]

rdd =sc.parallelize(a)

rdd.flatMapValues(lambda x:x[0]).map(lambda x: [x[0]]+x[1]).toDF().show()

Output

+-------+----------------+---------------+----+----+-------+-----+----+
|  _1   |       _2       |      _3       | _4 | _5 |  _6   | _7  | _8 |
+-------+----------------+---------------+----+----+-------+-----+----+
| 32719 | 200.73.55.34   | 192.16.48.217 |  0 |  6 | 10163 | 443 |  0 |
| 32719 | 177.207.76.243 | 192.16.58.8   |  0 |  6 | 59575 |  80 |  0 |
| 32897 | 200.73.55.34   | 193.16.48.217 |  0 |  6 | 10163 | 443 |  0 |
| 32897 | 167.207.76.243 | 194.16.58.8   |  0 |  6 | 59575 |  80 |  0 |
+-------+----------------+---------------+----+----+-------+-----+----+

answered Mar 31, 2017 at 13:12

Himaprasoon

2,6993 gold badges28 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user2825083 Over a year ago

Thank you. A follow up question. If I want to create multiple dataframes from this dataframe by grouping them with the values in column one. How would one do that.

Himaprasoon Over a year ago

What are you gonna do with those multiple dataframes?

user2825083 Over a year ago

Each of the dataframes would go into separate tables, depending on the key in the first column.

Suresh · Accepted Answer · 2017-03-31 12:23:40Z

0

You can map to add the key to each value and create dataframe. I tried in my way,

>>>dat1 = [(32719, [[u'200.73.55.34', u'192.16.48.217', 0, 6, 10163, 443, 0], [u'177.207.76.243', u'192.16.58.8', 0, 6, 59575, 80, 0]]),(32897, [[u'200.73.55.34', u'193.16.48.217', 0, 6, 10163, 443, 0], [u'167.207.76.243', u'194.16.58.8', 0, 6, 59575, 80, 0]])]

>>>rdd1 = sc.parallelize(dat1).map(lambda x : [[x[0]]+i for i in x[1]]).flatMap(lambda x : (x))
>>>df = rdd1.toDF(['col1','col2','col3','col4','col5','col6','col7','col8'])
>>> df.show()
+-----+--------------+-------------+----+----+-----+----+----+
| col1|          col2|         col3|col4|col5| col6|col7|col8|
+-----+--------------+-------------+----+----+-----+----+----+
|32719|  200.73.55.34|192.16.48.217|   0|   6|10163| 443|   0|
|32719|177.207.76.243|  192.16.58.8|   0|   6|59575|  80|   0|
|32897|  200.73.55.34|193.16.48.217|   0|   6|10163| 443|   0|
|32897|167.207.76.243|  194.16.58.8|   0|   6|59575|  80|   0|
+-----+--------------+-------------+----+----+-----+----+----+

answered Mar 31, 2017 at 12:23

Suresh

5,8802 gold badges27 silver badges42 bronze badges

2 Comments

Himaprasoon Over a year ago

Check your input data and the OP's data. There is a small difference in the square brackets for each item.

Suresh Over a year ago

The input in question itself has difference in brackets.

Collectives™ on Stack Overflow

pyspark dataframe from rdd containing key and values as list of lists

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related