8

How can I combine columns in spark as a nested array?

val inputSmall = Seq(
    ("A", 0.3, "B", 0.25),
    ("A", 0.3, "g", 0.4),
    ("d", 0.0, "f", 0.1),
    ("d", 0.0, "d", 0.7),
    ("A", 0.3, "d", 0.7),
    ("d", 0.0, "g", 0.4),
    ("c", 0.2, "B", 0.25)).toDF("column1", "transformedCol1", "column2", "transformedCol2")

To something similar as

+-------+---------------+---------------+------- +
|column1|transformedCol1|transformedCol2|combined|
+-------+---------------+---------------+------ -+
|      A|            0.3|            0.3[0.3, 0.3]|
+-------+---------------+---------------+-------+

2 Answers 2

29

If you want to combine multiple columns into a new column of ArrayType, you can use the array function:

import org.apache.spark.sql.functions._
val result = inputSmall.withColumn("combined", array($"transformedCol1", $"transformedCol2"))
result.show()

+-------+---------------+-------+---------------+-----------+
|column1|transformedCol1|column2|transformedCol2|   combined|
+-------+---------------+-------+---------------+-----------+
|      A|            0.3|      B|           0.25|[0.3, 0.25]|
|      A|            0.3|      g|            0.4| [0.3, 0.4]|
|      d|            0.0|      f|            0.1| [0.0, 0.1]|
|      d|            0.0|      d|            0.7| [0.0, 0.7]|
|      A|            0.3|      d|            0.7| [0.3, 0.7]|
|      d|            0.0|      g|            0.4| [0.0, 0.4]|
|      c|            0.2|      B|           0.25|[0.2, 0.25]|
+-------+---------------+-------+---------------+-----------+
Sign up to request clarification or add additional context in comments.

2 Comments

What if I have an array of columns to be combined? val names=Seq("foo", "bar") followed by withColumn("combined", array(names:_*)) is unsupported. That means that if the names change dynamically there seems to be no way of achieving this.
Aha, it will accept multiple columns, as opposed to multiple strings, so this works: val names = Seq("foo", "bar"); frame.withColumn("combined", array(names.map(frame(_)):_*))
0

I would rather use the below, will handle column changes

inputSmall.withColumn("combined", F.array("*"))

1 Comment

This doesn't work. Throws pyspark.sql.utils.AnalysisException: "cannot resolve 'array( ...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.