6

I have a spark dataset like this one:

key id val1 val2 val3
1   a  a1   a2   a3
2   a  a4   a5   a6
3   b  b1   b2   b3
4   b  b4   b5   b6
5   b  b7   b8   b9
6   c  c1   c2   c3

I would like to group all rows by id in a list or array like this:

(a, ([1   a  a1   a2   a3], [2   a  a4   a5   a6]) ),
(b, ([3   b  b1   b2   b3], [4   b  b4   b5   b6], [5   b  b7   b8   b9]) ),
(c, ([6   c  c1   c2   c3]) )

I have used map to output key/value pairs with the right key but I have troubles in building the final key/array.

Can anybody help with that?

3 Answers 3

8

how about this:

import org.apache.spark.sql.functions._
df.withColumn("combined",array("key","id","val1","val2","val3")).groupby("id").agg(collect_list($"combined"))

The Array function converts the columns into an array of column and then its a simple groupby with collect_list

Sign up to request clarification or add additional context in comments.

2 Comments

This looks good. I have changed the code to select all columns: df.withColumn("combined",array( df.columns map col: _*)).groupby("id").agg(collect_list($"combined")) Is there any more concise way? Thanks a lot!
On top of giving a good answer, Assaf deserves a shout out for being the rare breed of JVM user who actually posts answers with the exact import needed!
0

import org.apache.spark.ml.feature.VectorAssembler

import org.apache.spark.sql.functions._

val assembler = new VectorAssembler() .setInputCols(Array("key", "id", "val1", "val2", "val3","score")) .setOutputCol("combined")

val dfRes = assembler.transform(df).groupby("id").agg(collect_list($"combined"))

3 Comments

I like this solution but looks like VectorAssembler does not work with StringType.
yes, it does not indeed, below Asaafs is the easiest solution, alternatively you can use org.apache.spark.ml.feature.{IndexToString, StringIndexer} but I guess it will be too much hustle here
Thanks. I liked you're solution more because there is no need for the additional column and my dataset has ~150 columns
0

File Content of my xzy.txt file

key id val1 val2 val3
1   a  a1   a2   a3
2   a  a4   a5   a6
3   b  b1   b2   b3
4   b  b4   b5   b6
5   b  b7   b8   b9
6   c  c1   c2   c3

Code with Required Output

enter image description here

Input file Content

enter image description here

1 Comment

Would be much easier to read if you copy and paste your code here. Instead of a screenshot.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.