7

I have a dataset containing data like the following:

|c1| c2|
---------
| 1 | a |
| 1 | b |
| 1 | c |
| 2 | a |
| 2 | b |

...

Now, I want to get the data grouped like the following (col1: String Key, col2: List):

| c1| c2 |
-----------
| 1 |a,b,c|
| 2 | a, b|
...

I thought that using goupByKey would be an sufficient solution, but I can't find any example, how to use it.

Can anyone help me to find a solution using groupByKey or using any other combination of transformations and actions to get this output by using datasets, not RDD?

3 Answers 3

3

Here is Spark 2.0 and Java example with Dataset.

public class SparkSample {
    public static void main(String[] args) {
    //SparkSession
    SparkSession spark = SparkSession
            .builder()
            .appName("SparkSample")
            .config("spark.sql.warehouse.dir", "/file:C:/temp")
            .master("local")
            .getOrCreate();     
    //input data
    List<Tuple2<Integer,String>> inputList = new ArrayList<Tuple2<Integer,String>>();
    inputList.add(new Tuple2<Integer,String>(1, "a"));
    inputList.add(new Tuple2<Integer,String>(1, "b"));
    inputList.add(new Tuple2<Integer,String>(1, "c"));
    inputList.add(new Tuple2<Integer,String>(2, "a"));
    inputList.add(new Tuple2<Integer,String>(2, "b"));          
    //dataset
    Dataset<Row> dataSet = spark.createDataset(inputList, Encoders.tuple(Encoders.INT(), Encoders.STRING())).toDF("c1","c2");
    dataSet.show();     
    //groupBy and aggregate
    Dataset<Row> dataSet1 = dataSet.groupBy("c1").agg(org.apache.spark.sql.functions.collect_list("c2")).toDF("c1","c2");
    dataSet1.show();
    //stop
    spark.stop();
  }
}
Sign up to request clarification or add additional context in comments.

1 Comment

Glad I could help.
1

With a DataFrame in Spark 2.0:

scala> val data = List((1, "a"), (1, "b"), (1, "c"), (2, "a"), (2, "b")).toDF("c1", "c2")
data: org.apache.spark.sql.DataFrame = [c1: int, c2: string]
scala> data.groupBy("c1").agg(collect_list("c2")).collect.foreach(println)
[1,WrappedArray(a, b, c)]
[2,WrappedArray(a, b)]

Comments

0

This will read the table in to dataset variable

Dataset<Row> datasetNew = dataset.groupBy("c1").agg(functions.collect_list("c2"));
datasetNew.show()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.