6
  1. If I have a rdd, how do I understand the data is in key:value format? is there a way to find the same - something like type(object) tells me an object's type. I tried print type(rdd.take(1)), but it just says <type 'list'>.
  2. Let's say I have a data like (x,1),(x,2),(y,1),(y,3) and I use groupByKey and got (x,(1,2)),(y,(1,3)). Is there a way to define (1,2) and (1,3) as values where x and y are keys? Or does a key has to be a single value? I noted that if I use reduceByKey and sum function to get the data ((x,3),(y,4)) then it becomes much easier to define this data as a key-value pair
7
  • 1. rdd.first() 2. Please clarify. groupByKey is usually for cases you really eventually need the entire list. Commented Feb 29, 2016 at 15:39
  • 1. wouldn't rdd.first() return me just the first datapoint? I want to know whether the data is in a key-value format or not. 2.Yes, I have used groupByKey to get the entire data, but i want it in key-value format Commented Feb 29, 2016 at 15:56
  • You want it as a map? What about collectAsMap? Taking the first you will get a tuple, what do you mean by key-value format? what kind of type do you expect? Commented Mar 1, 2016 at 9:53
  • i couldnt find a good simple source on collectAsMap. Please share if you have anything. Would it be possible to provide a simple example? Commented Mar 1, 2016 at 13:53
  • 1
    Just try it yourself. The output would roughly be - {"a" : [1, 2, 3], "b" : [4], ..} Commented Mar 1, 2016 at 14:48

1 Answer 1

6

Python is a dynamically typed language and PySpark doesn't use any special type for key, value pairs. The only requirement for an object being considered a valid data for PairRDD operations is that it can be unpacked as follows:

k, v = kv

Typically you would use a two element tuple due to its semantics (immutable object of fixed size) and similarity to Scala Product classes. But this is just a convention and nothing stops you from something like this:

key_value.py

class KeyValue(object):
    def __init__(self, k, v):
        self.k = k
        self.v = v
    def __iter__(self):
       for x in [self.k, self.v]:
           yield x
from key_value import KeyValue

rdd = sc.parallelize(
    [KeyValue("foo", 1), KeyValue("foo", 2), KeyValue("bar", 0)]) 

rdd.reduceByKey(add).collect()
## [('bar', 0), ('foo', 3)]

and make an arbitrary class behave like a key-value. So once again if something can be correctly unpacked as a pair of objects then it is a valid key-value. Implementing __len__ and __getitem__ magic methods should work as well. Probably the most elegant way to handle this is to use namedtuples.

Also type(rdd.take(1)) returns a list of length n so its type will be always the same.

Sign up to request clarification or add additional context in comments.

2 Comments

I am learning from you. But I am still confused about something. Let's say for whatever reason, I used groupByKey, I will get [('bar', (0)), ('foo', (1,2))]...now can I use something like rdd.map(lambda x: (x[0],len(x[1]))? I know the same can be done using countByKey, but I want to use the 'groupByKey'
(0) is not a valid tuple literal. It is just 0. Otherwise exactly like this.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.