1

I have two arrays:

val customers = Array("Alice", "Bob", "Mike","Charly")
val customersLen = customers.length 
val items = Array("milk", "bread", "butter", "apples", "oranges")
val itemsLen = items.length 
val size = (customersLen*itemsLen)-1

I can create in an array the cartesian products of these two arrays like this:

val cartesianProd= for(i <- 0 to size) yield (customers(i % customersLen ),items(i % itemsLen ))

The output would be:

cartesianProd: scala.collection.immutable.IndexedSeq[(String, String)] = Vector((Alice,milk), (Bob,bread), (Mike,butter), (Charly,apples), (Alice,oranges), (Bob,milk), (Mike,bread), (Charly,butter), (Alice,apples), (Bob,oranges), (Mike,milk), (Charly,bread), (Alice,butter), (Bob,apples), (Mike,oranges), (Charly,milk), (Alice,bread), (Bob,butter), (Mike,apples), (Charly,oranges))

Now I would like to generate a Dataframe from this array. Reusing the previous logic, so I wrote:

val dfCustItem = Seq(for(i <- 0 to size ) yield(customers (i % customersLen ),items(i % itemsLen ))).toDF("customer","item")

But I get the following error:

java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match. Old column names (1): value New column names (2): name, item at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.Dataset.toDF(Dataset.scala:397) at org.apache.spark.sql.DatasetHolder.toDF(DatasetHolder.scala:44) ... 48 elided

As I understand this is because yield return the pair(customer,item) in a single column named value, and toDF is expecting two independent columns (not sure if the names of this columns is relevant or not)

How can I solve this issue? that is to yield the output of the loop in two independent columns

1
  • Why not convert each array into DataFrame and then do a cartesian product of those Dataframes ? Commented Jun 21, 2018 at 1:46

1 Answer 1

1

You have an extra Seq so just removing that should work

val dfCustItem = (for(i <- 0 to size ) yield(customers (i % customersLen ),items(i % itemsLen ))).toDF("customer","item")

Explanation:

As you can see that for(i <- 0 to size) yield (customers(i % customersLen ),items(i % itemsLen )) is already scala.collection.immutable.IndexedSeq[(String, String)] and adding a Seq will generate Seq[scala.collection.immutable.IndexedSeq[(String, String)]] whereas you need is Tuple2 elements inside Seq

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.