PySpark: how to map by first item in array

Question

My initial RDD is a list of blocks in which each block is a list of lines itself. So it's

[infos_var1, infos_var2]

and each block is

var_name, var_value1, var_value2, var_value3

The original data looks like this:

[[u'::852-YF-007\t',
  u'2016-05-10 00:00:00\t0',
  u'2016-05-09 23:59:00\t0',
  u'2016-05-09 23:42:00\t0'],
 [u'::852-YF-008\t',
  u'2016-05-10 00:00:00\t0',
  u'2016-05-09 23:59:00\t0',
  u'2016-05-09 23:42:00\t0']]

My question is how to use a map-function to extract the variable name (852-YF-007 and 852-YF-008) as key and as value the lines with the timestamp (here: 3 lines for each variable?

Maybe someone can give me a hint how to use map on my RDD. I was thinking of something like this:

df.map(lambda (k, v): (v[0], v[0-vEND]))

PS: The original post on how I created my initial RDD can be found here.

Something like this (I dont have any pyspark at hand) ? df.map(lambda i : (i[0], i[1:])) — ccheneson
– ccheneson, Commented Jun 30, 2016 at 10:33

ccheneson · Accepted Answer · 2016-06-30 12:11:04Z

1

What you have is a list of list of items and not tuple

Try this:

df.map(lambda i : (i[0], i[1:]))

For the i[1:] part , look up for slicing in here

answered Jun 30, 2016 at 12:11

ccheneson

49.4k8 gold badges65 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PySpark: how to map by first item in array

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related