How to determine if object is a valid key-value pair in PySpark

Question

If I have a rdd, how do I understand the data is in key:value format? is there a way to find the same - something like type(object) tells me an object's type. I tried print type(rdd.take(1)), but it just says <type 'list'>.
Let's say I have a data like (x,1),(x,2),(y,1),(y,3) and I use groupByKey and got (x,(1,2)),(y,(1,3)). Is there a way to define (1,2) and (1,3) as values where x and y are keys? Or does a key has to be a single value? I noted that if I use reduceByKey and sum function to get the data ((x,3),(y,4)) then it becomes much easier to define this data as a key-value pair

1. rdd.first() 2. Please clarify. groupByKey is usually for cases you really eventually need the entire list. — Tom Ron
– Tom Ron, Commented Feb 29, 2016 at 15:39
1. wouldn't rdd.first() return me just the first datapoint? I want to know whether the data is in a key-value format or not. 2.Yes, I have used groupByKey to get the entire data, but i want it in key-value format — user2543622
– user2543622, Commented Feb 29, 2016 at 15:56
You want it as a map? What about collectAsMap? Taking the first you will get a tuple, what do you mean by key-value format? what kind of type do you expect? — Tom Ron
– Tom Ron, Commented Mar 1, 2016 at 9:53
i couldnt find a good simple source on collectAsMap. Please share if you have anything. Would it be possible to provide a simple example? — user2543622
– user2543622, Commented Mar 1, 2016 at 13:53
Just try it yourself. The output would roughly be - {"a" : [1, 2, 3], "b" : [4], ..} — Tom Ron
– Tom Ron, Commented Mar 1, 2016 at 14:48

zero323 · Accepted Answer · 2016-12-05 21:26:33Z

6

Python is a dynamically typed language and PySpark doesn't use any special type for key, value pairs. The only requirement for an object being considered a valid data for PairRDD operations is that it can be unpacked as follows:

k, v = kv

Typically you would use a two element tuple due to its semantics (immutable object of fixed size) and similarity to Scala Product classes. But this is just a convention and nothing stops you from something like this:

key_value.py

class KeyValue(object):
    def __init__(self, k, v):
        self.k = k
        self.v = v
    def __iter__(self):
       for x in [self.k, self.v]:
           yield x

from key_value import KeyValue

rdd = sc.parallelize(
    [KeyValue("foo", 1), KeyValue("foo", 2), KeyValue("bar", 0)]) 

rdd.reduceByKey(add).collect()
## [('bar', 0), ('foo', 3)]

and make an arbitrary class behave like a key-value. So once again if something can be correctly unpacked as a pair of objects then it is a valid key-value. Implementing __len__ and __getitem__ magic methods should work as well. Probably the most elegant way to handle this is to use namedtuples.

Also type(rdd.take(1)) returns a list of length n so its type will be always the same.

edited Dec 5, 2016 at 21:26

user6022341

answered Feb 29, 2016 at 16:09

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user2543622 Over a year ago

I am learning from you. But I am still confused about something. Let's say for whatever reason, I used groupByKey, I will get [('bar', (0)), ('foo', (1,2))]...now can I use something like rdd.map(lambda x: (x[0],len(x[1]))? I know the same can be done using countByKey, but I want to use the 'groupByKey'

zero323 Over a year ago

(0) is not a valid tuple literal. It is just 0. Otherwise exactly like this.

Collectives™ on Stack Overflow

How to determine if object is a valid key-value pair in PySpark

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related