create PySpark Dataframe column based on class method - with parameters

Question

I have a python class and it has functions like below:

class Features():
    def __init__(self, json):
        self.json = json

    def email_name_match(self,name):
        #some code
        return result

The pyspark dataframe I have for now looks like below:

 +---------------+-----------
 |raw_json         |firstName
 +----------------+----------
 |                 |  
 +----------------+--------
 |                 |  
 +----------------+-------

And I am trying to use the email_name_match function in a pyspark dataframe to create a new column. So the "raw_json" column should be passed to initialize an object of Features and "firstName" should be passed as the "name" parameter of the method "email_name_match".

I did the following:

email_name_match_udf = F.udf(lambda j: NationalRetailFeatures(json.loads(j)).email_name_match())
df = avtk_gold.withColumn('firstname_email_match', F.udf(lambda j: NationalRetailFeatures(json.loads(j)).email_name_match(col("firstName")))("raw_json"))

But it's not working. It shows error:

AttributeError: 'NoneType' object has no attribute '_jvm

What should I do for this? The ideal dataframe looks like this:

 +---------------+-------------+-----------
 |raw_json       |firstName   |   name_email_match
 +----------------+------------------------
 |                 |          |
 +----------------+-------------------------
 |                 |          |
 +----------------+------------------------

please provide a minimal code snippet that reproduces the error, including the full traceback. Which line of the code gives the attribute error? Also your code is full of undefined variables. It's impossible to know what you're trying to do. — mck
– mck, Commented Nov 12, 2020 at 9:05

mck · Accepted Answer · 2020-11-12 09:09:48Z

1

Your question is missing a lot of information but I guess this might work:

import pyspark.sql.functions as F

class Features():
    def __init__(self, json):
        self.json = json

    def email_name_match(self,name):
        #some code
        return result

def my_udf(j, k):
    return Features(json.loads(j)).email_name_match(k)

spark_udf = F.udf(my_udf)

df = avtk_gold.withColumn('firstname_email_match', 
                          spark_udf(F.col("firstName"), F.col("raw_json"))
                         )

You cannot put a Spark column inside your UDF definition. You can only pass Spark columns into the UDF. That's why I have two function parameters j and k.

answered Nov 12, 2020 at 9:09

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

create PySpark Dataframe column based on class method - with parameters

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related