1

I have a pyspark dataframe column

df.groupBy('Gender').count().show()
(5) Spark Jobs
+------+------+
|Gender| count|
+------+------+
|     F| 44015|
|  null| 42175|
|     M|104423|
|      |     1|
+------+------+

I am doing regexp_replace

#df = df.fillna({'Gender':'missing'})
df = df.withColumn('Gender', regexp_replace('Gender', 'F','Female'))
df = df.withColumn('Gender', regexp_replace('Gender', 'M','Male'))
df = df.withColumn('Gender', regexp_replace('Gender', ' ','missing'))

Instead of calling df for each line, can this be done in one line?

1 Answer 1

1

If you do not want to use regexp_replace 3 times, you can use when/otherwise clause.

from pyspark.sql import functions as F
from pyspark.sql.functions import when

df.withColumn("Gender", F.when(F.col("Gender")=='F',F.lit("Female"))\
              .when(F.col("Gender")=='M',F.lit("Male"))\
              .otherwise(F.lit("missing"))).show()

+-------+------+
| Gender| count|
+-------+------+
| Female| 44015|
|missing| 42175|
|   Male|104423|
|missing|     1|
+-------+------+

Or you could do your three regexp_replace in one line like this:

from pyspark.sql.functions import regexp_replace
df.withColumn('Gender', regexp_replace(regexp_replace(regexp_replace('Gender', 'F','Female'),'M','Male'),' ','missing')).show()

+-------+------+
| Gender| count|
+-------+------+
| Female| 44015|
|   null| 42175|
|   Male|104423|
|missing|     1|
+-------+------+

I think when/otherwise should outperform 3 regexp_replace functions because you will need to use fillna with them too.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.