2

I have pyspark dataframe with Firstname and Middlename columns . Middlename column has null values in it.

customer_df=

FName Middlename 
Avi   null
Chec  Bor-iin
Meg   null
Zen   Cha-gn

I have written UDF to strip hypens

from pyspark.sql.functions import col, udf, upper, lit, when
replacehyphens = udf(lambda string_val: string_val.replace('-',''))
customer_df=customer_df.withColumn('Middlename',
when('Middlename'.isNull,lit('')).otherwise
(replacehyphens(col('Middlename'))))

I am getting AttributeError: 'str' object has no attribute 'isNull'

What am i missing here ?

1 Answer 1

3

By using 'Middlename'.isNull, you are calling isNull method on a string instead of the column object. You need col('Middlename').isNull() or df.Middlename.isNull(); Or you can use regexp_replace method instead of creating a udf:

from pyspark.sql.functions import regexp_replace
df.withColumn('Middlename', regexp_replace(df.Middlename, '-', '')).show()
+-----+----------+
|FName|Middlename|
+-----+----------+
|  Avi|      null|
| Chec|    Boriin|
|  Meg|      null|
|  Zen|     Chagn|
+-----+----------+

To replace null with empty string, use na.fill(''):

df.withColumn('Middlename', regexp_replace(df.Middlename, '-', '')).na.fill('', 'Middlename').show()
+-----+----------+
|FName|Middlename|
+-----+----------+
|  Avi|          |
| Chec|    Boriin|
|  Meg|          |
|  Zen|     Chagn|
+-----+----------+

If you have to use a udf, make sure you do the null check inside the udf to avoid the Nonetype error:

replacehyphens = udf(lambda s: s.replace('-', '') if s else '')
df.withColumn('Middlename', replacehyphens('Middlename')).show()
+-----+----------+
|FName|Middlename|
+-----+----------+
|  Avi|          |
| Chec|    Boriin|
|  Meg|          |
|  Zen|     Chagn|
+-----+----------+
Sign up to request clarification or add additional context in comments.

3 Comments

Hi Psidom, when i use column customer_df = customer_df.withColumn('MiddleName',when(col('MiddleName').isNull,lit('')).otherwise(stripsc(col('MiddleName')))) . I am getting TypeError: condition should be a Column.The reason I am trying to create udf is because , I would be adding more special characters into it based on incoming dataframe
Did you miss the parenthesis after isNull? notice it's isNull().
Ahh , got it ! Thank you !

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.