How to create Scala trait which stores data from other columns in dataset and then create new dataset with column storing the trait in Scala?

Question

I am new to Scala and am currently studying datasets for Scala and Spark. Based on my input dataset below, I am trying to create a new dataset (see below). In the new dataset, I aim to have a new column which contains a Scala trait Seq[order_summary]. The Scala trait stores data the corresponding Name, Ticket Number, and Seat Number taken from the input dataset.

I have implemented input_dataset.groupyBy("Name") to organise the dataset and have tried df.withColumn("NewColumn", struct(df("a"), df("b"))) to combine different columns together. However, I would like to use a Scala trait instead and am also stuck with matching the name to the ticket number. Would anyone know how to resolve this or point me towards the right direction?

Input dataset: input_dataset

Name Type is String. Ticket Number Type is Int

+----+---------------+-------------+
|Name| Ticket Number | Seat Number |
+----+---------------+-------------+
|Adam|      123      |     AB      |
|Adam|      456      |     AC      |
|Adam|      789      |     AD      |
|Bob |     1234      |     BA      |
|Bob |     5678      |     BB      |
|Sam |      987      |     CA      |
|Sam |      654      |     CB      |
|Sam |      321      |     CC      |
|Sam |      876      |     CD      |
+----+---------------+-------------+

Output dataset

Name Type is String. Purchase Order Summary is a trait, Seq[order_summary]

+----+-----------------------------------------------------+
|Name| Purchase Order Summary                              |
+----+-----------------------------------------------------+
|Adam|((Adam,123,AB),(Adam,456,AC),(Adam,789,AD))          | 
|Bob |((Bob,1234,BA),(Bob,5678,BB))                        |
|Sam |((Sam,987,CA),(Sam,654,CB),(Sam,321,CC),(Sam,876,CD))|
+----+-----------------------------------------------------+

A Scala trait is just an interface. They don't store data per se. You're looking for an implementation of the interface. — Dasph
– Dasph, Commented Feb 7, 2023 at 10:01

Dasph · Accepted Answer · 2023-02-07 10:07:20Z

0

Pretty sure Spark has a map method.

So you could just create a case class

case class PurchaseOrderSummary(name: String, ticketNum: Long, seatNum: Int)

and instantiate it inside a map from your DF, then collect it into a list.

df.map(row => PurchaseOrderSummary(row.getString(0), row.getLong(1), row.getInt(2))).collectAsList

collectAsList should retrieve data from the RDD and transform it to a scala List[PurchaseOrderSummary].

answered Feb 7, 2023 at 10:07

Dasph

4605 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to create Scala trait which stores data from other columns in dataset and then create new dataset with column storing the trait in Scala?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related