0

I have a RDD like below with keys and values as list of list containing some parameters.

(32719, [[[u'200.73.55.34', u'192.16.48.217', 0, 6, 10163, 443, 0], [u'177.207.76.243', u'192.16.58.8', 0, 6, 59575, 80, 0]])
(32897, [[[u'200.73.55.34', u'193.16.48.217', 0, 6, 10163, 443, 0], [u'167.207.76.243', u'194.16.58.8', 0, 6, 59575, 80, 0]])

I want to create a dataframe with rows and columns as below

32719, '200.73.55.34', u'192.16.48.217', 0, 6, 10163, 443, 0
32719, '177.207.76.243', u'192.16.58.8', 0, 6, 59575, 80, 0
32897, 200.73.55.34', u'193.16.48.217', 0, 6, 10163, 443, 0

Or just a dataframe of all the values but grouped by the key. How can I do this.

2 Answers 2

3

Use flat map Values

a =[(32719, [[[u'200.73.55.34', u'192.16.48.217', 0, 6, 10163, 443, 0], [u'177.207.76.243', u'192.16.58.8', 0, 6, 59575, 80, 0]]]),
(32897, [[[u'200.73.55.34', u'193.16.48.217', 0, 6, 10163, 443, 0], [u'167.207.76.243', u'194.16.58.8', 0, 6, 59575, 80, 0]]])]

rdd =sc.parallelize(a)

rdd.flatMapValues(lambda x:x[0]).map(lambda x: [x[0]]+x[1]).toDF().show()

Output

+-------+----------------+---------------+----+----+-------+-----+----+
|  _1   |       _2       |      _3       | _4 | _5 |  _6   | _7  | _8 |
+-------+----------------+---------------+----+----+-------+-----+----+
| 32719 | 200.73.55.34   | 192.16.48.217 |  0 |  6 | 10163 | 443 |  0 |
| 32719 | 177.207.76.243 | 192.16.58.8   |  0 |  6 | 59575 |  80 |  0 |
| 32897 | 200.73.55.34   | 193.16.48.217 |  0 |  6 | 10163 | 443 |  0 |
| 32897 | 167.207.76.243 | 194.16.58.8   |  0 |  6 | 59575 |  80 |  0 |
+-------+----------------+---------------+----+----+-------+-----+----+
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you. A follow up question. If I want to create multiple dataframes from this dataframe by grouping them with the values in column one. How would one do that.
What are you gonna do with those multiple dataframes?
Each of the dataframes would go into separate tables, depending on the key in the first column.
0

You can map to add the key to each value and create dataframe. I tried in my way,

>>>dat1 = [(32719, [[u'200.73.55.34', u'192.16.48.217', 0, 6, 10163, 443, 0], [u'177.207.76.243', u'192.16.58.8', 0, 6, 59575, 80, 0]]),(32897, [[u'200.73.55.34', u'193.16.48.217', 0, 6, 10163, 443, 0], [u'167.207.76.243', u'194.16.58.8', 0, 6, 59575, 80, 0]])]

>>>rdd1 = sc.parallelize(dat1).map(lambda x : [[x[0]]+i for i in x[1]]).flatMap(lambda x : (x))
>>>df = rdd1.toDF(['col1','col2','col3','col4','col5','col6','col7','col8'])
>>> df.show()
+-----+--------------+-------------+----+----+-----+----+----+
| col1|          col2|         col3|col4|col5| col6|col7|col8|
+-----+--------------+-------------+----+----+-----+----+----+
|32719|  200.73.55.34|192.16.48.217|   0|   6|10163| 443|   0|
|32719|177.207.76.243|  192.16.58.8|   0|   6|59575|  80|   0|
|32897|  200.73.55.34|193.16.48.217|   0|   6|10163| 443|   0|
|32897|167.207.76.243|  194.16.58.8|   0|   6|59575|  80|   0|
+-----+--------------+-------------+----+----+-----+----+----+

2 Comments

Check your input data and the OP's data. There is a small difference in the square brackets for each item.
The input in question itself has difference in brackets.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.