I have two arrays:
val customers = Array("Alice", "Bob", "Mike","Charly")
val customersLen = customers.length
val items = Array("milk", "bread", "butter", "apples", "oranges")
val itemsLen = items.length
val size = (customersLen*itemsLen)-1
I can create in an array the cartesian products of these two arrays like this:
val cartesianProd= for(i <- 0 to size) yield (customers(i % customersLen ),items(i % itemsLen ))
The output would be:
cartesianProd: scala.collection.immutable.IndexedSeq[(String, String)] = Vector((Alice,milk), (Bob,bread), (Mike,butter), (Charly,apples), (Alice,oranges), (Bob,milk), (Mike,bread), (Charly,butter), (Alice,apples), (Bob,oranges), (Mike,milk), (Charly,bread), (Alice,butter), (Bob,apples), (Mike,oranges), (Charly,milk), (Alice,bread), (Bob,butter), (Mike,apples), (Charly,oranges))
Now I would like to generate a Dataframe from this array. Reusing the previous logic, so I wrote:
val dfCustItem = Seq(for(i <- 0 to size ) yield(customers (i % customersLen ),items(i % itemsLen ))).toDF("customer","item")
But I get the following error:
java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match. Old column names (1): value New column names (2): name, item at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.Dataset.toDF(Dataset.scala:397) at org.apache.spark.sql.DatasetHolder.toDF(DatasetHolder.scala:44) ... 48 elided
As I understand this is because yield return the pair(customer,item) in a single column named value, and toDF is expecting two independent columns (not sure if the names of this columns is relevant or not)
How can I solve this issue? that is to yield the output of the loop in two independent columns