0

I am new to Scala and am currently studying datasets for Scala and Spark. Based on my input dataset below, I am trying to create a new dataset (see below). In the new dataset, I aim to have a new column which contains a Scala trait Seq[order_summary]. The Scala trait stores data the corresponding Name, Ticket Number, and Seat Number taken from the input dataset.

I have implemented input_dataset.groupyBy("Name") to organise the dataset and have tried df.withColumn("NewColumn", struct(df("a"), df("b"))) to combine different columns together. However, I would like to use a Scala trait instead and am also stuck with matching the name to the ticket number. Would anyone know how to resolve this or point me towards the right direction?

Input dataset: input_dataset

Name Type is String. Ticket Number Type is Int

+----+---------------+-------------+
|Name| Ticket Number | Seat Number |
+----+---------------+-------------+
|Adam|      123      |     AB      |
|Adam|      456      |     AC      |
|Adam|      789      |     AD      |
|Bob |     1234      |     BA      |
|Bob |     5678      |     BB      |
|Sam |      987      |     CA      |
|Sam |      654      |     CB      |
|Sam |      321      |     CC      |
|Sam |      876      |     CD      |
+----+---------------+-------------+

Output dataset

Name Type is String. Purchase Order Summary is a trait, Seq[order_summary]

+----+-----------------------------------------------------+
|Name| Purchase Order Summary                              |
+----+-----------------------------------------------------+
|Adam|((Adam,123,AB),(Adam,456,AC),(Adam,789,AD))          | 
|Bob |((Bob,1234,BA),(Bob,5678,BB))                        |
|Sam |((Sam,987,CA),(Sam,654,CB),(Sam,321,CC),(Sam,876,CD))|
+----+-----------------------------------------------------+
1
  • A Scala trait is just an interface. They don't store data per se. You're looking for an implementation of the interface. Commented Feb 7, 2023 at 10:01

1 Answer 1

0

Pretty sure Spark has a map method.

So you could just create a case class

case class PurchaseOrderSummary(name: String, ticketNum: Long, seatNum: Int)

and instantiate it inside a map from your DF, then collect it into a list.

df.map(row => PurchaseOrderSummary(row.getString(0), row.getLong(1), row.getInt(2))).collectAsList

collectAsList should retrieve data from the RDD and transform it to a scala List[PurchaseOrderSummary].

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.