0

I have a Dataframe:

ID      |  program  |  
--------|-----------|
53-8975 |  null     |
53-9875 |  null     |
53A7569 |           | 
53-9456 |  XXXX     |
53-9875 |           |
---------------------

The ID and the program are String. I want to fill all null or "" in program column by the letter K and if the 4th digit in the ID column is 9. For example:

I have two ID that there 4th is 9: 53-9875 and 53-9456 and the values of program column is respectively are: null and ""

How can I fill the program column by the letter K if the 4th digit in the ID column is 9 and the program is null or "" using pyspark.

To be my Dataframe:

ID      |  program  |  
--------|-----------|
53-8975 |  null     |
53-9875 |  K        |
53A7569 |           | 
53-9456 |  XXXX     |
53-9875 |   K       |
---------------------

1 Answer 1

1

So if we have your original dataframe:

df = spark.createDataFrame([("53-8975", None), ("53-9875", None), ("53A7569", ""), ("53-9456", "XXXX"), ("53-9875", "")], ["id", "program"])
df.show()
+-------+-------+
|     id|program|
+-------+-------+
|53-8975|   null|
|53-9875|   null|
|53A7569|       |
|53-9456|   XXXX|
|53-9875|       |
+-------+-------+

We can create a transformation that takes program or "k" according to your specification with when().otherwise():

from pyspark.sql.functions import *

programNullOrEmpty = (col("program") == "") | (isnull(col("program")))
id9 = col("id").substr(4,1) == "9"

df.withColumn("program", when(programNullOrEmpty & id9, lit("K"))
                         .otherwise(col("program")))\
    .show()

+-------+-------+
|     id|program|
+-------+-------+
|53-8975|   null|
|53-9875|      K|
|53A7569|       |
|53-9456|   XXXX|
|53-9875|      K|
+-------+-------+
Sign up to request clarification or add additional context in comments.

3 Comments

thank you for your answer, In fact I changed it like this, I used your solution: output = ( df.select( F.col('program'), F.col('ID') .withColumn("program", F.when((F.col("program") == "") | (isnull(F.col("program"))) & (F.col("ID").substr(4,1) == "9"), lit("K")).otherwise(F.col("program"))) ) ) I got this error: TypeError: 'Column' object is not callable Some help please ?
There were issues with parentheses in your modifications, it should work like output = ( df.select( F.col('program'), F.col('ID') ).withColumn("program", F.when(((F.col("program") == "") | (F.isnull(F.col("program")))) & (F.col("ID").substr(4,1) == "9"), F.lit("K")).otherwise(F.col("program"))) )
If you found the answer useful please accept the answer and optionally upvote :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.