How to do regexp_replace in one line in pyspark dataframe?

Question

I have a pyspark dataframe column

df.groupBy('Gender').count().show()
(5) Spark Jobs
+------+------+
|Gender| count|
+------+------+
|     F| 44015|
|  null| 42175|
|     M|104423|
|      |     1|
+------+------+

I am doing regexp_replace

#df = df.fillna({'Gender':'missing'})
df = df.withColumn('Gender', regexp_replace('Gender', 'F','Female'))
df = df.withColumn('Gender', regexp_replace('Gender', 'M','Male'))
df = df.withColumn('Gender', regexp_replace('Gender', ' ','missing'))

Instead of calling df for each line, can this be done in one line?

murtihash · Accepted Answer · 2020-04-16 02:29:06Z

If you do not want to use regexp_replace 3 times, you can use when/otherwise clause.

from pyspark.sql import functions as F
from pyspark.sql.functions import when

df.withColumn("Gender", F.when(F.col("Gender")=='F',F.lit("Female"))\
              .when(F.col("Gender")=='M',F.lit("Male"))\
              .otherwise(F.lit("missing"))).show()

+-------+------+
| Gender| count|
+-------+------+
| Female| 44015|
|missing| 42175|
|   Male|104423|
|missing|     1|
+-------+------+

Or you could do your three regexp_replace in one line like this:

from pyspark.sql.functions import regexp_replace
df.withColumn('Gender', regexp_replace(regexp_replace(regexp_replace('Gender', 'F','Female'),'M','Male'),' ','missing')).show()

+-------+------+
| Gender| count|
+-------+------+
| Female| 44015|
|   null| 42175|
|   Male|104423|
|missing|     1|
+-------+------+

I think when/otherwise should outperform 3 regexp_replace functions because you will need to use fillna with them too.

Collectives™ on Stack Overflow

How to do regexp_replace in one line in pyspark dataframe?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related