In my case how to split a column contain StringType with a format '1-1235.0 2-1248.0 3-7895.2' to another column with ArrayType contains ['1','2','3']
2 Answers
this is relatively simple with UDF:
val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("input")
val extractFirst = udf((s: String) => s.split(" ").map(_.split('-')(0).toInt))
df.withColumn("newCol", extractFirst($"input"))
.show()
gives
+--------------------+---------+
| input| newCol|
+--------------------+---------+
|1-1235.0 2-1248.0...|[1, 2, 3]|
+--------------------+---------+
I could not find an easy soluton with spark internals (other than using split in combination with explode etc and then re-aggregating)
Comments
You can split the string to an array using split function and then you can transform the array using Higher Order Function TRANSFORM (it is available since Sark 2.4) together with substring_index:
import org.apache.spark.sql.functions.{split, expr}
val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("stringCol")
df.withColumn("array", split($"stringCol", " "))
.withColumn("result", expr("TRANSFORM(array, x -> substring_index(x, '-', 1))"))
Notice that this is native approach, no UDF applied.
2 Comments
Ged
Incorrect? What if val df = Seq("1-1235.0 55-1248.0 3-7895.2").toDF("stringCol"), e.g. value greater than 9?
David Vrba
@thebluephantom Thanks for pointing out more digit values. I edited the answer by replacing substring to substring_index.