0

In my case how to split a column contain StringType with a format '1-1235.0 2-1248.0 3-7895.2' to another column with ArrayType contains ['1','2','3']

2 Answers 2

1

this is relatively simple with UDF:

val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("input")

val extractFirst = udf((s: String) => s.split(" ").map(_.split('-')(0).toInt))

df.withColumn("newCol", extractFirst($"input"))
  .show()

gives

+--------------------+---------+
|               input|   newCol|
+--------------------+---------+
|1-1235.0 2-1248.0...|[1, 2, 3]|
+--------------------+---------+

I could not find an easy soluton with spark internals (other than using split in combination with explode etc and then re-aggregating)

Sign up to request clarification or add additional context in comments.

Comments

1

You can split the string to an array using split function and then you can transform the array using Higher Order Function TRANSFORM (it is available since Sark 2.4) together with substring_index:

import org.apache.spark.sql.functions.{split, expr}

val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("stringCol")

df.withColumn("array", split($"stringCol", " "))
  .withColumn("result", expr("TRANSFORM(array, x -> substring_index(x, '-', 1))"))

Notice that this is native approach, no UDF applied.

2 Comments

Incorrect? What if val df = Seq("1-1235.0 55-1248.0 3-7895.2").toDF("stringCol"), e.g. value greater than 9?
@thebluephantom Thanks for pointing out more digit values. I edited the answer by replacing substring to substring_index.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.