2

My initial RDD is a list of blocks in which each block is a list of lines itself. So it's

[infos_var1, infos_var2]

and each block is

var_name, var_value1, var_value2, var_value3

The original data looks like this:

[[u'::852-YF-007\t',
  u'2016-05-10 00:00:00\t0',
  u'2016-05-09 23:59:00\t0',
  u'2016-05-09 23:42:00\t0'],
 [u'::852-YF-008\t',
  u'2016-05-10 00:00:00\t0',
  u'2016-05-09 23:59:00\t0',
  u'2016-05-09 23:42:00\t0']]

My question is how to use a map-function to extract the variable name (852-YF-007 and 852-YF-008) as key and as value the lines with the timestamp (here: 3 lines for each variable?

Maybe someone can give me a hint how to use map on my RDD. I was thinking of something like this:

df.map(lambda (k, v): (v[0], v[0-vEND]))

PS: The original post on how I created my initial RDD can be found here.

1
  • 2
    Something like this (I dont have any pyspark at hand) ? df.map(lambda i : (i[0], i[1:])) Commented Jun 30, 2016 at 10:33

1 Answer 1

1

What you have is a list of list of items and not tuple

Try this:

df.map(lambda i : (i[0], i[1:]))

For the i[1:] part , look up for slicing in here

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.