Can I use regexp_replace or some equivalent to replace multiple values in a pyspark dataframe column with one line of code?

Question

Here is the code to create my dataframe:

from pyspark import SparkContext, SparkConf, SQLContext
from datetime import datetime

sc = SparkContext().getOrCreate()
sqlContext = SQLContext(sc)

data1 = [
  ('George', datetime(2010, 3, 24, 3, 19, 58), 13),
  ('George', datetime(2020, 9, 24, 3, 19, 6), 8),
  ('George', datetime(2009, 12, 12, 17, 21, 30), 5),
  ('Micheal', datetime(2010, 11, 22, 13, 29, 40), 12),
  ('Maggie', datetime(2010, 2, 8, 3, 31, 23), 8),
  ('Ravi', datetime(2009, 1, 1, 4, 19, 47), 2),
  ('Xien', datetime(2010, 3, 2, 4, 33, 51), 3),
]
 
df1 = sqlContext.createDataFrame(data1, ['name', 'trial_start_time', 'purchase_time'])
df1.show(truncate=False)

Here is the dataframe:

+-------+-------------------+-------------+
|name   |trial_start_time   |purchase_time|
+-------+-------------------+-------------+
|George |2010-03-24 07:19:58|13           |
|George |2020-09-24 07:19:06|8            |
|George |2009-12-12 22:21:30|5            |
|Micheal|2010-11-22 18:29:40|12           |
|Maggie |2010-02-08 08:31:23|8            |
|Ravi   |2009-01-01 09:19:47|2            |
|Xien   |2010-03-02 09:33:51|3            |
+-------+-------------------+-------------+

Here is a working example to replace one string:

from pyspark.sql.functions import regexp_replace, regexp_extract, col
df1.withColumn("name", regexp_replace('name', "Ravi", "Ravi_renamed")).show()

Here is the output:

+------------+-------------------+-------------+
|        name|   trial_start_time|purchase_time|
+------------+-------------------+-------------+
|      George|2010-03-24 07:19:58|           13|
|      George|2020-09-24 07:19:06|            8|
|      George|2009-12-12 22:21:30|            5|
|     Micheal|2010-11-22 18:29:40|           12|
|      Maggie|2010-02-08 08:31:23|            8|
|Ravi_renamed|2009-01-01 09:19:47|            2|
|        Xien|2010-03-02 09:33:51|            3|
+------------+-------------------+-------------+

In pandas I could replace multiple strings in one line of code with a lambda expression:

df1[name].apply(lambda x: x.replace('George','George_renamed1').replace('Ravi', 'Ravi_renamed2')

I am not sure if this can be done in pyspark with regexp_replace. Perhaps another alternative? When I read about using lambda expressions in pyspark it seems I have to create udf functions (which seem to get a little long). But I am curious if I can simply run some type of regex expression on multiple strings like above in one line of code.

What might I be doing wrong here: df1.withColumn("name", regexp_replace( regexp_replace('name', "Ravi", "Ravi_renamed"))('name', "George", "George_renamed")) give error: TypeError: regexp_replace() missing 2 required positional arguments: 'pattern' and 'replacement' — Kierk
– Kierk, Commented Aug 22, 2020 at 15:10
df1.withColumn("name", regexp_replace( regexp_replace('name', "Ravi", "Ravi_renamed"), "George", "George_renamed")) — Daeho Ro
– Daeho Ro, Commented Aug 22, 2020 at 15:18

Topde · Accepted Answer · 2020-08-27 20:43:07Z

3

This is what you're looking for:

Using `when()` (most readable)

df1.withColumn('name', 
               when(col('name') == 'George', 'George_renamed1')
               .when(col('name') == 'Ravi', 'Ravi_renamed2')
               .otherwise(col('name'))
              )

With mapping expr (less explicit but handy if there's many values to replace)

df1 = df1.withColumn('name', F.expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name], name)"))

or if you already have a list to use i.e. name_changes = ['George', 'George_renamed1', 'Ravi', 'Ravi_renamed2']

# str()[1:-1] to convert list to string and remove [ ]
df1 = df1.withColumn('name', expr(f'coalesce(map({str(name_changes)[1:-1]})[name], name)'))

the above but only using pyspark imported functions

mapping_expr = create_map([lit(x) for x in name_changes])

df1 = df1.withColumn('name', coalesce(mapping_expr[df1['name']], 'name'))

Result

df1.withColumn('name', F.expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name],name)")).show()
+---------------+-------------------+-------------+
|           name|   trial_start_time|purchase_time|
+---------------+-------------------+-------------+
|George_renamed1|2010-03-24 03:19:58|           13|
|George_renamed1|2020-09-24 03:19:06|            8|
|George_renamed1|2009-12-12 17:21:30|            5|
|        Micheal|2010-11-22 13:29:40|           12|
|         Maggie|2010-02-08 03:31:23|            8|
|  Ravi_renamed2|2009-01-01 04:19:47|            2|
|           Xien|2010-03-02 04:33:51|            3|
+---------------+-------------------+-------------+

edited Aug 27, 2020 at 20:43

answered Aug 22, 2020 at 16:34

Topde

5815 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Kierk Over a year ago

I like the simple expr approach. However it chnages some of the names to null. Is there a way to leave those names alone? ```| null|2010-11-22 18:29:40| 12| | null|2010-02-08 08:31:23| 8|``

Topde Over a year ago

Updated the answer with coalesce step for when there's not match i.e. replace null entries with the original column

Kierk Over a year ago

Certainly. Away from laptop for weekend.

Kierk Over a year ago

I accepted the answer before testing it as I was away. I ran df1 = df1.withColumn('name', expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name])"), name) but it throws exception

NameError                                 Traceback (most recent call last) <ipython-input-21-d4be9725afe1> in <module> ----> 1 df1 = df1.withColumn('name', expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name])"), name) NameError: name 'name' is not defined

I am troubleshooting to no avail yet. Any insight?

Kierk Over a year ago

all set. Thx @Dee

|

Collectives™ on Stack Overflow

Can I use regexp_replace or some equivalent to replace multiple values in a pyspark dataframe column with one line of code?

1 Answer 1

Using `when()` (most readable)

With mapping expr (less explicit but handy if there's many values to replace)

Result

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Using when() (most readable)

With mapping expr (less explicit but handy if there's many values to replace)

Result

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Using `when()` (most readable)