0

there are some filter functions like this:

def filter1(x):
    if broadcast_variable1.value[x] > 1:
       return False
    return True

def filter2(x):
    if broadcast_variable2.value[x] < 1:
       error_accumulator_variable.add(1)
       return False
    return True

these function are included in my main pyspark script. Now I want to split them into a module file for easier maintenance (I have two pyspark scripts for different use but they have the same filters. ), keep the rest in different files.

how to share these-like filter functions in different pyspark script?

Thanks for your generous help!

2 Answers 2

1

My basic purpose is to make the program more readable. So I find a simple way to achieve this.

def x1(x,broadcast_variable1):
   if broadcast_variable1.value[x] > 1:
   return False

def filter1(x):
   return x1(x,broadcast_variable1)

and this can also be rewriten in the currying way(code below is not in python):

def x1(broadcast_variable)(x) = 
   if broadcast_variable.value[x] > 1:
   return False
def filter1(x) = x1(broadcast_variable1)

now, I can write spark code like this:

RDD.filter(filter1).map(map1)....
Sign up to request clarification or add additional context in comments.

Comments

0

Use it as argument for functions:

def filter1(x, bv):
   ...

def filter2(x, bv):
   ...

broadcast_variable1 = ...

someRDD.filter(lambda x: filter1(x, broadcast_variable1))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.