- If I have a rdd, how do I understand the data is in key:value
format? is there a way to find the same - something like
type(object) tells me an object's type. I tried
print type(rdd.take(1)), but it just says<type 'list'>. - Let's say I have a data like
(x,1),(x,2),(y,1),(y,3)and I usegroupByKeyand got(x,(1,2)),(y,(1,3)). Is there a way to define(1,2)and(1,3)as values where x and y are keys? Or does a key has to be a single value? I noted that if I usereduceByKeyandsumfunction to get the data((x,3),(y,4))then it becomes much easier to define this data as a key-value pair
1 Answer
Python is a dynamically typed language and PySpark doesn't use any special type for key, value pairs. The only requirement for an object being considered a valid data for PairRDD operations is that it can be unpacked as follows:
k, v = kv
Typically you would use a two element tuple due to its semantics (immutable object of fixed size) and similarity to Scala Product classes. But this is just a convention and nothing stops you from something like this:
key_value.py
class KeyValue(object):
def __init__(self, k, v):
self.k = k
self.v = v
def __iter__(self):
for x in [self.k, self.v]:
yield x
from key_value import KeyValue
rdd = sc.parallelize(
[KeyValue("foo", 1), KeyValue("foo", 2), KeyValue("bar", 0)])
rdd.reduceByKey(add).collect()
## [('bar', 0), ('foo', 3)]
and make an arbitrary class behave like a key-value. So once again if something can be correctly unpacked as a pair of objects then it is a valid key-value. Implementing __len__ and __getitem__ magic methods should work as well. Probably the most elegant way to handle this is to use namedtuples.
Also type(rdd.take(1)) returns a list of length n so its type will be always the same.
2 Comments
groupByKey, I will get [('bar', (0)), ('foo', (1,2))]...now can I use something like rdd.map(lambda x: (x[0],len(x[1]))? I know the same can be done using countByKey, but I want to use the 'groupByKey'(0) is not a valid tuple literal. It is just 0. Otherwise exactly like this.
rdd.first()2. Please clarify.groupByKeyis usually for cases you really eventually need the entire list.rdd.first()return me just the first datapoint? I want to know whether the data is in a key-value format or not. 2.Yes, I have usedgroupByKeyto get the entire data, but i want it in key-value format