0

Can I use regexp_replace or some equivalent to replace multiple values in a pyspark dataframe column with one line of code?

Here is the code to create my dataframe:

from pyspark import SparkContext, SparkConf, SQLContext
from datetime import datetime

sc = SparkContext().getOrCreate()
sqlContext = SQLContext(sc)

data1 = [
  ('George', datetime(2010, 3, 24, 3, 19, 58), 13),
  ('George', datetime(2020, 9, 24, 3, 19, 6), 8),
  ('George', datetime(2009, 12, 12, 17, 21, 30), 5),
  ('Micheal', datetime(2010, 11, 22, 13, 29, 40), 12),
  ('Maggie', datetime(2010, 2, 8, 3, 31, 23), 8),
  ('Ravi', datetime(2009, 1, 1, 4, 19, 47), 2),
  ('Xien', datetime(2010, 3, 2, 4, 33, 51), 3),
]
 
df1 = sqlContext.createDataFrame(data1, ['name', 'trial_start_time', 'purchase_time'])
df1.show(truncate=False)

Here is the dataframe:

+-------+-------------------+-------------+
|name   |trial_start_time   |purchase_time|
+-------+-------------------+-------------+
|George |2010-03-24 07:19:58|13           |
|George |2020-09-24 07:19:06|8            |
|George |2009-12-12 22:21:30|5            |
|Micheal|2010-11-22 18:29:40|12           |
|Maggie |2010-02-08 08:31:23|8            |
|Ravi   |2009-01-01 09:19:47|2            |
|Xien   |2010-03-02 09:33:51|3            |
+-------+-------------------+-------------+

Here is a working example to replace one string:

from pyspark.sql.functions import regexp_replace, regexp_extract, col
df1.withColumn("name", regexp_replace('name', "Ravi", "Ravi_renamed")).show()

Here is the output:

+------------+-------------------+-------------+
|        name|   trial_start_time|purchase_time|
+------------+-------------------+-------------+
|      George|2010-03-24 07:19:58|           13|
|      George|2020-09-24 07:19:06|            8|
|      George|2009-12-12 22:21:30|            5|
|     Micheal|2010-11-22 18:29:40|           12|
|      Maggie|2010-02-08 08:31:23|            8|
|Ravi_renamed|2009-01-01 09:19:47|            2|
|        Xien|2010-03-02 09:33:51|            3|
+------------+-------------------+-------------+

In pandas I could replace multiple strings in one line of code with a lambda expression:

df1[name].apply(lambda x: x.replace('George','George_renamed1').replace('Ravi', 'Ravi_renamed2')

I am not sure if this can be done in pyspark with regexp_replace. Perhaps another alternative? When I read about using lambda expressions in pyspark it seems I have to create udf functions (which seem to get a little long). But I am curious if I can simply run some type of regex expression on multiple strings like above in one line of code.

3
  • 1
    regexp_replace(regexp_replace(...)) Commented Aug 22, 2020 at 14:35
  • What might I be doing wrong here: df1.withColumn("name", regexp_replace( regexp_replace('name', "Ravi", "Ravi_renamed"))('name', "George", "George_renamed")) give error: TypeError: regexp_replace() missing 2 required positional arguments: 'pattern' and 'replacement' Commented Aug 22, 2020 at 15:10
  • 2
    df1.withColumn("name", regexp_replace( regexp_replace('name', "Ravi", "Ravi_renamed"), "George", "George_renamed")) Commented Aug 22, 2020 at 15:18

1 Answer 1

3

This is what you're looking for:

Using when() (most readable)

df1.withColumn('name', 
               when(col('name') == 'George', 'George_renamed1')
               .when(col('name') == 'Ravi', 'Ravi_renamed2')
               .otherwise(col('name'))
              )

With mapping expr (less explicit but handy if there's many values to replace)

df1 = df1.withColumn('name', F.expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name], name)"))

or if you already have a list to use i.e. name_changes = ['George', 'George_renamed1', 'Ravi', 'Ravi_renamed2']

# str()[1:-1] to convert list to string and remove [ ]
df1 = df1.withColumn('name', expr(f'coalesce(map({str(name_changes)[1:-1]})[name], name)'))

the above but only using pyspark imported functions

mapping_expr = create_map([lit(x) for x in name_changes])

df1 = df1.withColumn('name', coalesce(mapping_expr[df1['name']], 'name'))

Result

df1.withColumn('name', F.expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name],name)")).show()
+---------------+-------------------+-------------+
|           name|   trial_start_time|purchase_time|
+---------------+-------------------+-------------+
|George_renamed1|2010-03-24 03:19:58|           13|
|George_renamed1|2020-09-24 03:19:06|            8|
|George_renamed1|2009-12-12 17:21:30|            5|
|        Micheal|2010-11-22 13:29:40|           12|
|         Maggie|2010-02-08 03:31:23|            8|
|  Ravi_renamed2|2009-01-01 04:19:47|            2|
|           Xien|2010-03-02 04:33:51|            3|
+---------------+-------------------+-------------+
Sign up to request clarification or add additional context in comments.

9 Comments

I like the simple expr approach. However it chnages some of the names to null. Is there a way to leave those names alone? ```| null|2010-11-22 18:29:40| 12| | null|2010-02-08 08:31:23| 8|``
Updated the answer with coalesce step for when there's not match i.e. replace null entries with the original column
Certainly. Away from laptop for weekend.
I accepted the answer before testing it as I was away. I ran df1 = df1.withColumn('name', expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name])"), name) but it throws exception NameError Traceback (most recent call last) <ipython-input-21-d4be9725afe1> in <module> ----> 1 df1 = df1.withColumn('name', expr("coalesce(map('George', 'George_renamed1', 'Ravi', 'Ravi_renamed2')[name])"), name) NameError: name 'name' is not defined I am troubleshooting to no avail yet. Any insight?
all set. Thx @Dee
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.