I would like to use our spark cluster to run programs in parallel. My idea is to do sth like the following:
def simulate():
#some magic happening in here
return 0
spark = (
SparkSession.builder
.appName('my_simulation')
.enableHiveSupport()
.getOrCreate())
sc = spark.sparkContext
no_parallel_instances = sc.parallelize(xrange(500))
res = no_parallel_instances.map(lambda row: simulate())
print res.collect()
The question i have is whether there's a way to execute simulate() with different parameters. The only way i currently can imagine is to have a dataframe specifying the parameters, so something like this:
parameter_list = [[5,2.3,3], [3,0.2,4]]
no_parallel_instances = sc.parallelize(parameter_list)
res = no_parallel_instances.map(lambda row: simulate(row))
print res.collect()
Is there another, more elegant way to run parallel functions with spark?