Spark - Creating a Dataframe from the cartesian product of two arrays in Scala

Question

I have two arrays:

val customers = Array("Alice", "Bob", "Mike","Charly")
val customersLen = customers.length 
val items = Array("milk", "bread", "butter", "apples", "oranges")
val itemsLen = items.length 
val size = (customersLen*itemsLen)-1

I can create in an array the cartesian products of these two arrays like this:

val cartesianProd= for(i <- 0 to size) yield (customers(i % customersLen ),items(i % itemsLen ))

The output would be:

cartesianProd: scala.collection.immutable.IndexedSeq[(String, String)] = Vector((Alice,milk), (Bob,bread), (Mike,butter), (Charly,apples), (Alice,oranges), (Bob,milk), (Mike,bread), (Charly,butter), (Alice,apples), (Bob,oranges), (Mike,milk), (Charly,bread), (Alice,butter), (Bob,apples), (Mike,oranges), (Charly,milk), (Alice,bread), (Bob,butter), (Mike,apples), (Charly,oranges))

Now I would like to generate a Dataframe from this array. Reusing the previous logic, so I wrote:

val dfCustItem = Seq(for(i <- 0 to size ) yield(customers (i % customersLen ),items(i % itemsLen ))).toDF("customer","item")

But I get the following error:

java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match. Old column names (1): value New column names (2): name, item at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.Dataset.toDF(Dataset.scala:397) at org.apache.spark.sql.DatasetHolder.toDF(DatasetHolder.scala:44) ... 48 elided

As I understand this is because yield return the pair(customer,item) in a single column named value, and toDF is expecting two independent columns (not sure if the names of this columns is relevant or not)

How can I solve this issue? that is to yield the output of the loop in two independent columns

Why not convert each array into DataFrame and then do a cartesian product of those Dataframes ? — Constantine
– Constantine, Commented Jun 21, 2018 at 1:46

Ignacio Alorre · Accepted Answer · 2018-06-20 11:25:57Z

1

You have an extra Seq so just removing that should work

val dfCustItem = (for(i <- 0 to size ) yield(customers (i % customersLen ),items(i % itemsLen ))).toDF("customer","item")

Explanation:

As you can see that for(i <- 0 to size) yield (customers(i % customersLen ),items(i % itemsLen )) is already scala.collection.immutable.IndexedSeq[(String, String)] and adding a Seq will generate Seq[scala.collection.immutable.IndexedSeq[(String, String)]] whereas you need is Tuple2 elements inside Seq

edited Jun 20, 2018 at 11:25

Ignacio Alorre

7,6558 gold badges65 silver badges104 bronze badges

answered Jun 20, 2018 at 11:19

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Spark - Creating a Dataframe from the cartesian product of two arrays in Scala

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related