I am learning Apache-Spark now. After carefully reading Spark tutorials, I understand how to pass a Python function to Apache-Spark to deal with RDD dataset. But now I still have no ideas on how Apache-Spark works with methods inside a class. For example, I have my code as below:
import numpy as np
import copy
from pyspark import SparkConf, SparkContext
class A():
def __init__(self, n):
self.num = n
class B(A):
### Copy the item of class A to B.
def __init__(self, A):
self.num = copy.deepcopy(A.num)
### Print out the item of B
def display(self, s):
print s.num
return s
def main():
### Locally run an application "test" using Spark.
conf = SparkConf().setAppName("test").setMaster("local[2]")
### Setup the Spark configuration.
sc = SparkContext(conf = conf)
### "data" is a list to store a list of instances of class A.
data = []
for i in np.arange(5):
x = A(i)
data.append(x)
### "lines" separate "data" in Spark.
lines = sc.parallelize(data)
### Parallelly creates a list of instances of class B using
### Spark "map".
temp = lines.map(B)
### Now I got the error when it runs the following code:
### NameError: global name 'display' is not defined.
temp1 = temp.map(display)
if __name__ == "__main__":
main()
In fact, I used the above code to parallelly generate a list of instances of class B using temp = lines.map(B). After that, I did temp1 = temp.map(display), as I wanted to parallelly print out each of the items in that list of instances of class B. But now the error shows up: NameError: global name 'display' is not defined. I am wondering how I can fix the error, if I still use Apache-Spark parallel computing. I really appreciate if anyone helps me.
displayis a method, so what you want islambda x: x.display(). 2. I've already mentioned that - if you're interested in side effects it is idiomatic to useforeach3. Another thing I've already mentioned is that printing won't work as you expect.temp1 = temp.foreach(lambda x: x.display()), it shows up a new error:AttributeError: 'module' object has no attribute 'A'. How can I fix this problem as well? I really appreciate for your help!printwon't work for more or less the same reason why you see above error. I outlined it in the answer for your previous question. Everything involving operations on RDDs happens on worker nodes. It means that output from print goes there and not to the driver. If you want to use class it has to be shipped to workers as well. One way to do it is to create a module. See here for details.sc = SparkContext(conf = conf)tosc = SparkContext(conf = conf, ['/mydir/test.py']), and I also changedtemp1 = temp.map(display)totemp1 = temp.map(lambda x: x.display()).reduce(lambda x: x), but I still got the error:'module' object has no attribute 'A'. Could you please help me find out the reasons?