How to replace string values in one column with actual column values from other columns in the same dataframe? Part 2

Question

I have some string values in one column and I would like to replace the substrings in that column with values in other columns, and replace all the plus signs with spaces (like below).

I have these List[String] mappings which are passed in dynamically where the mapFrom and mapTo should correlate in index.

Description values: mapFrom: ["Child", "ChildAge", "ChildState"]

Column names: mapTo: ["name", "age", "state"]

Input example:

name, age, state, description
tiffany, 10, virginia, Child + ChildAge + ChildState
andrew, 11, california, ChildState + Child + ChildAge
tyler, 12, ohio, ChildAge + ChildState + Child

Expected result:

name, age, state, description
tiffany, 10, virginia, tiffany 10 virginia
andrew, 11, california, california andrew 11
tyler, 12, ohio, 12 ohio tyler

How can I achieve this using Spark Scala?

When I try the solution from here: How to replace string values in one column with actual column values from other columns in the same dataframe?

The output becomes

name, age, state, description
tiffany, 10, virginia, tiffany tiffanyAge tiffanyState
andrew, 11, california, andrewState andrew andrewAge
tyler, 12, ohio, tylerAge tylerState tyler

For thw second Row how do you knowChildState + Child + Child which one is age and name, How do you know it? — koiralo
– koiralo, Commented Aug 5, 2019 at 13:51
I am assuming there is also a typo here: tyler, 12, ohio, ChildAge + ChildState + ChildName and that this should be tyler, 12, ohio, ChildAge + ChildState + Child, is that correct? — Jonathan Myers
– Jonathan Myers, Commented Aug 5, 2019 at 15:40
It looks like ChildName in mapFrom are actually Child while all ChildName in the input are actually only Child. I edited the question to reflect this, please tell me if it's wrong. — Shaido
– Shaido, Commented Aug 6, 2019 at 3:02

Gelerion · Accepted Answer · 2019-08-05 15:44:51Z

I would use map instead of built-in Spark functions.
Not the cleanest, but the working solution

val data = Seq(
  ("tiffany", 10, "virginia", "ChildName + ChildAge + ChildState"),
  ("andrew", 11, "california", "ChildState + ChildName + ChildAge"),
  ("tyler", 12, "ohio", "ChildAge + ChildState + ChildName")
).toDF("name", "age", "state", "description")

Define the schema for encoder conversions

val schema = StructType(Seq(
  StructField("name", StringType),
  StructField("age", IntegerType),
  StructField("state", StringType),
  StructField("description", StringType)
))
val encoder = RowEncoder(schema)

The logic itself

val res = data.map(row => {
  val desc = row.getAs[String]("description").replaceAll("\\s+", "").split("\\+")
  val sb = new StringBuilder()
  val map = desc.zipWithIndex.toMap.map(_.swap)

  map(0) match {
    case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
    case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
    case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
  }

  map(1) match {
    case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
    case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
    case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
  }

  map(2) match {
    case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
    case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
    case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
  }

  Row(row.getAs[String]("name"), row.getAs[Int]("age"), row.getAs[String]("state"), sb.toString())
}) (encoder)

Results

res.show(false)
+-------+---+----------+---------------------+
|name   |age|state     |description          | 
+-------+---+----------+---------------------+
|tiffany|10 |virginia  |tiffany 10 virginia  |
|andrew |11 |california|california andrew 11 |
|tyler  |12 |ohio      |12 ohio tyler        |
+-------+---+----------+---------------------+

Shaido · Accepted Answer · 2019-08-06 03:10:03Z

The problem here is due to the description containing Child. This is a subsequence of ChildAge and ChildState. Since a regex is used this means that the Child part will be replaced by the names resulting in strange outputs such as tiffanyAge and tiffanyState (note that the Child part here is replaced by the name).

There are two simple solutions in this case without changing the input:

Change the regex for Child to use lookahead:
```
val mapFrom = List("Child(?= )", "ChildAge", "ChildState") :+ " \\+ "
```
This will only match Child when there is a space afterwards.
Put Child last in the list. This means that ChildAge and ChildState will be matched first:
```
val mapFrom = List("ChildAge", "ChildState", "Child") :+ " \\+ "
```

Full solution with the first alternative:

val mapFrom = List("Child(?= )", "ChildAge", "ChildState") :+ " \\+ "
val mapTo = List("name", "age", "state").map(col) :+ lit(" ")
val mapToFrom = mapFrom.zip(mapTo)

val df2 = mapToFrom.foldLeft(df){case (df, (from, to)) => 
  df.withColumn("description", regexp_replace($"description", lit(from), to))
}

Collectives™ on Stack Overflow

How to replace string values in one column with actual column values from other columns in the same dataframe? Part 2

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related