2

We have below sample data frame

+-----------+---------------+--------------+
|customer_id|age             |post_code    |
+-----------+---------------+--------------+
|       1001|              50|   BS32 0HW  |
+-----------+---------------+--------------+

Then we get a string like this

useful_info = 'Customer [customer_id] is [age] years old and lives at [post_code].'

This is one of the example string and it could be any string with column names in it. I just need to replace those column names with actual values.

Now I need to add useful_info column but replacing with column values i.e. Expected data frame would be:

[Row(customer_id='1001', age=50, post_code='BS32 0HW', useful_info='Customer 1001 is 50 years old and lives at BS32 0HW.')]

Does anyone know how to do this?

4
  • the first output you have, is it dataframe? or string? if it's dataframe, what's those <BR> character doing? Commented Feb 12, 2020 at 3:21
  • do you have same message for all "Customer [customer_id] is [age] years old and lives at [post_code]." or will it change ? Commented Feb 12, 2020 at 5:41
  • Thanks for the response but the useful info string can be different and number of columns referring inside string can be different as well. Commented Feb 12, 2020 at 6:33
  • Gaurang Shah, BR is for line break, I just used for formatting stuff Commented Feb 12, 2020 at 6:46

2 Answers 2

2

Here is one way using regexp_replace function. You can have the columns you want to be replaced in the useful_info string column and build an expression column like this:

df = spark.createDataFrame([(1001, 50, "BS32 0HW")], ["customer_id", "age", "post_code"])

list_columns_replace = ["customer_id", "age", "post_code"]

# replace first column in the string
to_replace = f"\\\\[{list_columns_replace[0]}\\\\]"
replace_expr = f"regexp_replace(useful_info, '{to_replace}', {list_columns_replace[0]})"

# loop through other columns to replace and update replacement expression
for c in list_columns_replace[1:]:
    to_replace = f"\\\\[{c}\\\\]"
    replace_expr = f"regexp_replace({replace_expr}, '{to_replace}', {c})"

# add new column 
df.withColumn("useful_info", lit("Customer [customer_id] is [age] years old and lives at [post_code].")) \
  .withColumn("useful_info", expr(replace_expr)) \
  .show(1, False)

#+-----------+---+---------+----------------------------------------------------+
#|customer_id|age|post_code|useful_info                                         |
#+-----------+---+---------+----------------------------------------------------+
#|1001       |50 |BS32 0HW |Customer 1001 is 50 years old and lives at BS32 0HW.|
#+-----------+---+---------+----------------------------------------------------+
Sign up to request clarification or add additional context in comments.

Comments

1

You can go with below approach. Which will evaluate value of column dynamically.

Note:

(1) I have written one UDF in which I am using regex. If you have any more special character like underscore (_) in column name then also include that in regex .

(2) All logic is based on the pattern that Info contain column name as [column name]. Please update regex in case any other pattern.

>>> from pyspark.sql.functions import *
>>> import re
>>> df.show(10,False)
+-----------+---+---------+----------------------------------------------------------------------+
|customer_id|age|post_code|Info                                                                  |
+-----------+---+---------+----------------------------------------------------------------------+
|1001       |50 |BS32 0HW | Customer [customer_id] is [age] years old and lives at [post_code].  |
|1002       |39 |AQ74 0TH | Age of Customer '[customer_id]' is [age] and he lives at [post_code].|
|1003       |25 |RT23 0YJ | Customer [customer_id] lives at [post_code]. He is [age] years old.  |
+-----------+---+---------+----------------------------------------------------------------------+

>>> def evaluateExpr(Info,data):
...     matchpattern = re.findall(r"\[([A-Za-z0-9_ ]+)\]", Info)
...     out = Info
...     for x in matchpattern:
...                     out = out.replace("[" + x + "]", data[x])
...     return out
... 
>>> evalExprUDF = udf(evaluateExpr)
>>> df.withColumn("Info", evalExprUDF(col("Info"),struct([df[x] for x in df.columns]))).show(10,False)
+-----------+---+---------+-------------------------------------------------------+
|customer_id|age|post_code|Info                                                   |
+-----------+---+---------+-------------------------------------------------------+
|1001       |50 |BS32 0HW | Customer 1001 is 50 years old and lives at BS32 0HW.  |
|1002       |39 |AQ74 0TH | Age of Customer '1002' is 39 and he lives at AQ74 0TH.|
|1003       |25 |RT23 0YJ | Customer 1003 lives at RT23 0YJ. He is 25 years old.  |
+-----------+---+---------+-------------------------------------------------------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.