1

I have some string values in one column and I would like to replace the substrings in that column with values in other columns, and replace all the plus signs with spaces (like below).

I have these List[String] mappings which are passed in dynamically where the mapFrom and mapTo should correlate in index.

Description values: mapFrom: ["Child", "ChildAge", "ChildState"]

Column names: mapTo: ["name", "age", "state"]

Input example:

name, age, state, description
tiffany, 10, virginia, Child + ChildAge + ChildState
andrew, 11, california, ChildState + Child + ChildAge
tyler, 12, ohio, ChildAge + ChildState + Child

Expected result:

name, age, state, description
tiffany, 10, virginia, tiffany 10 virginia
andrew, 11, california, california andrew 11
tyler, 12, ohio, 12 ohio tyler

How can I achieve this using Spark Scala?

When I try the solution from here: How to replace string values in one column with actual column values from other columns in the same dataframe?

The output becomes

name, age, state, description
tiffany, 10, virginia, tiffany tiffanyAge tiffanyState
andrew, 11, california, andrewState andrew andrewAge
tyler, 12, ohio, tylerAge tylerState tyler
4
  • For thw second Row how do you knowChildState + Child + Child which one is age and name, How do you know it? Commented Aug 5, 2019 at 13:51
  • that was a typo sorry Commented Aug 5, 2019 at 15:32
  • 1
    I am assuming there is also a typo here: tyler, 12, ohio, ChildAge + ChildState + ChildName and that this should be tyler, 12, ohio, ChildAge + ChildState + Child, is that correct? Commented Aug 5, 2019 at 15:40
  • 1
    It looks like ChildName in mapFrom are actually Child while all ChildName in the input are actually only Child. I edited the question to reflect this, please tell me if it's wrong. Commented Aug 6, 2019 at 3:02

2 Answers 2

1

I would use map instead of built-in Spark functions.
Not the cleanest, but the working solution

val data = Seq(
  ("tiffany", 10, "virginia", "ChildName + ChildAge + ChildState"),
  ("andrew", 11, "california", "ChildState + ChildName + ChildAge"),
  ("tyler", 12, "ohio", "ChildAge + ChildState + ChildName")
).toDF("name", "age", "state", "description")

Define the schema for encoder conversions

val schema = StructType(Seq(
  StructField("name", StringType),
  StructField("age", IntegerType),
  StructField("state", StringType),
  StructField("description", StringType)
))
val encoder = RowEncoder(schema)

The logic itself

val res = data.map(row => {
  val desc = row.getAs[String]("description").replaceAll("\\s+", "").split("\\+")
  val sb = new StringBuilder()
  val map = desc.zipWithIndex.toMap.map(_.swap)

  map(0) match {
    case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
    case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
    case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
  }

  map(1) match {
    case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
    case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
    case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
  }

  map(2) match {
    case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
    case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
    case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
  }

  Row(row.getAs[String]("name"), row.getAs[Int]("age"), row.getAs[String]("state"), sb.toString())
}) (encoder)

Results

res.show(false)
+-------+---+----------+---------------------+
|name   |age|state     |description          | 
+-------+---+----------+---------------------+
|tiffany|10 |virginia  |tiffany 10 virginia  |
|andrew |11 |california|california andrew 11 |
|tyler  |12 |ohio      |12 ohio tyler        |
+-------+---+----------+---------------------+
Sign up to request clarification or add additional context in comments.

Comments

1

The problem here is due to the description containing Child. This is a subsequence of ChildAge and ChildState. Since a regex is used this means that the Child part will be replaced by the names resulting in strange outputs such as tiffanyAge and tiffanyState (note that the Child part here is replaced by the name).

There are two simple solutions in this case without changing the input:

  1. Change the regex for Child to use lookahead:

    val mapFrom = List("Child(?= )", "ChildAge", "ChildState") :+ " \\+ "
    

    This will only match Child when there is a space afterwards.

  2. Put Child last in the list. This means that ChildAge and ChildState will be matched first:

    val mapFrom = List("ChildAge", "ChildState", "Child") :+ " \\+ "
    

Full solution with the first alternative:

val mapFrom = List("Child(?= )", "ChildAge", "ChildState") :+ " \\+ "
val mapTo = List("name", "age", "state").map(col) :+ lit(" ")
val mapToFrom = mapFrom.zip(mapTo)

val df2 = mapToFrom.foldLeft(df){case (df, (from, to)) => 
  df.withColumn("description", regexp_replace($"description", lit(from), to))
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.