pypark replace column values

Question

We have below sample data frame

+-----------+---------------+--------------+
|customer_id|age             |post_code    |
+-----------+---------------+--------------+
|       1001|              50|   BS32 0HW  |
+-----------+---------------+--------------+

Then we get a string like this

useful_info = 'Customer [customer_id] is [age] years old and lives at [post_code].'

This is one of the example string and it could be any string with column names in it. I just need to replace those column names with actual values.

Now I need to add useful_info column but replacing with column values i.e. Expected data frame would be:

[Row(customer_id='1001', age=50, post_code='BS32 0HW', useful_info='Customer 1001 is 50 years old and lives at BS32 0HW.')]

Does anyone know how to do this?

the first output you have, is it dataframe? or string? if it's dataframe, what's those <BR> character doing? — Gaurang Shah
– Gaurang Shah, Commented Feb 12, 2020 at 3:21
do you have same message for all "Customer [customer_id] is [age] years old and lives at [post_code]." or will it change ? — NIKHIL SUTHAR
– NIKHIL SUTHAR, Commented Feb 12, 2020 at 5:41
Thanks for the response but the useful info string can be different and number of columns referring inside string can be different as well. — DataNoob
– DataNoob, Commented Feb 12, 2020 at 6:33
Gaurang Shah, BR is for line break, I just used for formatting stuff — DataNoob
– DataNoob, Commented Feb 12, 2020 at 6:46

blackbishop · Accepted Answer · 2020-02-12 10:10:02Z

Here is one way using regexp_replace function. You can have the columns you want to be replaced in the useful_info string column and build an expression column like this:

df = spark.createDataFrame([(1001, 50, "BS32 0HW")], ["customer_id", "age", "post_code"])

list_columns_replace = ["customer_id", "age", "post_code"]

# replace first column in the string
to_replace = f"\\\\[{list_columns_replace[0]}\\\\]"
replace_expr = f"regexp_replace(useful_info, '{to_replace}', {list_columns_replace[0]})"

# loop through other columns to replace and update replacement expression
for c in list_columns_replace[1:]:
    to_replace = f"\\\\[{c}\\\\]"
    replace_expr = f"regexp_replace({replace_expr}, '{to_replace}', {c})"

# add new column 
df.withColumn("useful_info", lit("Customer [customer_id] is [age] years old and lives at [post_code].")) \
  .withColumn("useful_info", expr(replace_expr)) \
  .show(1, False)

#+-----------+---+---------+----------------------------------------------------+
#|customer_id|age|post_code|useful_info                                         |
#+-----------+---+---------+----------------------------------------------------+
#|1001       |50 |BS32 0HW |Customer 1001 is 50 years old and lives at BS32 0HW.|
#+-----------+---+---------+----------------------------------------------------+

NIKHIL SUTHAR · Accepted Answer · 2020-02-12 09:49:53Z

You can go with below approach. Which will evaluate value of column dynamically.

Note:

(1) I have written one UDF in which I am using regex. If you have any more special character like underscore (_) in column name then also include that in regex .

(2) All logic is based on the pattern that Info contain column name as [column name]. Please update regex in case any other pattern.

>>> from pyspark.sql.functions import *
>>> import re
>>> df.show(10,False)
+-----------+---+---------+----------------------------------------------------------------------+
|customer_id|age|post_code|Info                                                                  |
+-----------+---+---------+----------------------------------------------------------------------+
|1001       |50 |BS32 0HW | Customer [customer_id] is [age] years old and lives at [post_code].  |
|1002       |39 |AQ74 0TH | Age of Customer '[customer_id]' is [age] and he lives at [post_code].|
|1003       |25 |RT23 0YJ | Customer [customer_id] lives at [post_code]. He is [age] years old.  |
+-----------+---+---------+----------------------------------------------------------------------+

>>> def evaluateExpr(Info,data):
...     matchpattern = re.findall(r"\[([A-Za-z0-9_ ]+)\]", Info)
...     out = Info
...     for x in matchpattern:
...                     out = out.replace("[" + x + "]", data[x])
...     return out
... 
>>> evalExprUDF = udf(evaluateExpr)
>>> df.withColumn("Info", evalExprUDF(col("Info"),struct([df[x] for x in df.columns]))).show(10,False)
+-----------+---+---------+-------------------------------------------------------+
|customer_id|age|post_code|Info                                                   |
+-----------+---+---------+-------------------------------------------------------+
|1001       |50 |BS32 0HW | Customer 1001 is 50 years old and lives at BS32 0HW.  |
|1002       |39 |AQ74 0TH | Age of Customer '1002' is 39 and he lives at AQ74 0TH.|
|1003       |25 |RT23 0YJ | Customer 1003 lives at RT23 0YJ. He is 25 years old.  |
+-----------+---+---------+-------------------------------------------------------+

Collectives™ on Stack Overflow

pypark replace column values

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related