0

I have a python class and it has functions like below:

class Features():
    def __init__(self, json):
        self.json = json

    def email_name_match(self,name):
        #some code
        return result

The pyspark dataframe I have for now looks like below:

 +---------------+-----------
 |raw_json         |firstName
 +----------------+----------
 |                 |  
 +----------------+--------
 |                 |  
 +----------------+-------

And I am trying to use the email_name_match function in a pyspark dataframe to create a new column. So the "raw_json" column should be passed to initialize an object of Features and "firstName" should be passed as the "name" parameter of the method "email_name_match".

I did the following:

email_name_match_udf = F.udf(lambda j: NationalRetailFeatures(json.loads(j)).email_name_match())
df = avtk_gold.withColumn('firstname_email_match', F.udf(lambda j: NationalRetailFeatures(json.loads(j)).email_name_match(col("firstName")))("raw_json"))

But it's not working. It shows error:

AttributeError: 'NoneType' object has no attribute '_jvm

What should I do for this? The ideal dataframe looks like this:

 +---------------+-------------+-----------
 |raw_json       |firstName   |   name_email_match
 +----------------+------------------------
 |                 |          |
 +----------------+-------------------------
 |                 |          |
 +----------------+------------------------
1
  • please provide a minimal code snippet that reproduces the error, including the full traceback. Which line of the code gives the attribute error? Also your code is full of undefined variables. It's impossible to know what you're trying to do. Commented Nov 12, 2020 at 9:05

1 Answer 1

1

Your question is missing a lot of information but I guess this might work:

import pyspark.sql.functions as F

class Features():
    def __init__(self, json):
        self.json = json

    def email_name_match(self,name):
        #some code
        return result

def my_udf(j, k):
    return Features(json.loads(j)).email_name_match(k)

spark_udf = F.udf(my_udf)

df = avtk_gold.withColumn('firstname_email_match', 
                          spark_udf(F.col("firstName"), F.col("raw_json"))
                         )

You cannot put a Spark column inside your UDF definition. You can only pass Spark columns into the UDF. That's why I have two function parameters j and k.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.