I have a number of columns in a spark dataframe that I want to combine into one column and add a separating character between each column. I don't want to combine all the columns together with the character separating them, just some of them. In this example, I would like to add a pipe between the values of everything besides the first two columns.
Here is an example input:
+---+--------+----------+----------+---------+
|id | detail | context | col3 | col4|
+---+--------+----------+----------+---------+
| 1 | {blah} | service | null | null |
| 2 | { blah | """ blah | """blah} | service |
| 3 | { blah | """blah} | service | null |
+---+--------+----------+----------+---------+
The expected output would be something like this:
+---+--------+----------+----------+---------+--------------------------------+
|id | detail | context | col3 | col4| data
+---+--------+----------+----------+---------+--------------------------------+
| 1 | {blah} | service | null | null | service||
| 2 | { blah | """ blah | """blah} | service | """blah|"""blah}|service
| 3 | { blah | """blah} | service | null | """blah}|service|
+---+--------+----------+----------+---------+--------------------------------+
Currently, I have something like the following:
val columns = df.columns.filterNot(_ == "id").filterNot(_ =="detail")
val nonulls = df.na.fill("")
val combined = nonulls.select($"id", concat(columns.map(col):_*) as "data")
The above combines the columns together, but doesn't add in the additional character. If I tried these possibilities, but I'm obviously not doing it right:
scala> val combined = nonulls.select($"id", concat(columns.map(col):_|*) as "data")
scala> val combined = nonulls.select($"id", concat(columns.map(col):_*, lit('|')) as "data")
scala> val combined = nonulls.select($"id", concat(columns.map(col):_*|) as "data")
Any suggestions would be much appreciated! :) Thanks!
""" blahis this supposed to be one string?