1

I have a number of columns in a spark dataframe that I want to combine into one column and add a separating character between each column. I don't want to combine all the columns together with the character separating them, just some of them. In this example, I would like to add a pipe between the values of everything besides the first two columns.

Here is an example input:

+---+--------+----------+----------+---------+
|id | detail | context  |     col3 |     col4|
+---+--------+----------+----------+---------+
| 1 | {blah} | service  | null     | null    |
| 2 | { blah | """ blah | """blah} | service |
| 3 | { blah | """blah} | service  | null    |
+---+--------+----------+----------+---------+

The expected output would be something like this:

+---+--------+----------+----------+---------+--------------------------------+
|id | detail | context  |     col3 |     col4| data 
+---+--------+----------+----------+---------+--------------------------------+
| 1 | {blah} | service  | null     | null    | service||
| 2 | { blah | """ blah | """blah} | service | """blah|"""blah}|service
| 3 | { blah | """blah} | service  | null    | """blah}|service|
+---+--------+----------+----------+---------+--------------------------------+

Currently, I have something like the following:

val columns = df.columns.filterNot(_ == "id").filterNot(_ =="detail")
val nonulls = df.na.fill("")
val combined = nonulls.select($"id", concat(columns.map(col):_*) as "data")

The above combines the columns together, but doesn't add in the additional character. If I tried these possibilities, but I'm obviously not doing it right:

scala> val combined = nonulls.select($"id", concat(columns.map(col):_|*) as "data")

scala> val combined = nonulls.select($"id", concat(columns.map(col):_*, lit('|')) as "data")

scala> val combined = nonulls.select($"id", concat(columns.map(col):_*|) as "data")

Any suggestions would be much appreciated! :) Thanks!

2
  • """ blah is this supposed to be one string? Commented Nov 30, 2017 at 21:00
  • @mtoto yes it's one string. just showing that some of the column values could have spaces as well. Commented Nov 30, 2017 at 21:05

2 Answers 2

2

This should do the trick:

val columns = df.columns.filterNot(_ == "id").filterNot(_ =="detail") 
val columnsWithPipe = columns.flatMap(colname => Seq(col(colname),lit("|"))).dropRight(1)
val combined = nonulls.select($"id",concat(columnsWithPipe:_*) as "data")
Sign up to request clarification or add additional context in comments.

1 Comment

This works! Why do you turn it into a flatMap and use the dropRight?
0

Just use the concat_ws function ... it concatenates columns with a separator of your choice.

It's imported as import org.apache.spark.sql.functions.concat_ws

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.