How to create a sample dataframe in Scala / Spark

Question

I'm trying to create a simple DataFrame as follows:

import sqlContext.implicits._

val lookup = Array("one", "two", "three", "four", "five")

val theRow = Array("1",Array(1,2,3), Array(0.1,0.4,0.5))

val theRdd = sc.makeRDD(theRow)

case class X(id: String, indices: Array[Integer], weights: Array[Float] )

val df = theRdd.map{
    case Array(s0,s1,s2) =>    X(s0.asInstanceOf[String],s1.asInstanceOf[Array[Integer]],s2.asInstanceOf[Array[Float]])
}.toDF()

df.show()

df is defined as

df: org.apache.spark.sql.DataFrame = [id: string, indices: array<int>, weights: array<float>]

which is what I want.

Upon executing, I get

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 13.0 failed 1 times, most recent failure: Lost task 1.0 in stage 13.0 (TID 50, localhost): scala.MatchError: 1 (of class java.lang.String)

Where is this MatchError coming from? And, is there a simpler way to create sample DataFrames programmatically?

Vijay Anand Pandian · Accepted Answer · 2020-08-24 08:13:43Z

5

For another example that you can refer

import spark.implicits._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val columns=Array("id", "first", "last", "year")
val df1=sc.parallelize(Seq(
  (1, "John", "Doe", 1986),
  (2, "Ive", "Fish", 1990),
  (4, "John", "Wayne", 1995)
)).toDF(columns: _*)

val df2=sc.parallelize(Seq(
  (1, "John", "Doe", 1986),
  (2, "IveNew", "Fish", 1990),
  (3, "San", "Simon", 1974)
)).toDF(columns: _*)

answered Aug 24, 2020 at 8:13

Vijay Anand Pandian

1,18513 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Christian Hirsch · Accepted Answer · 2016-02-13 20:21:06Z

4

First, theRow should be a Row and not an Array. Now, if you modify your types in such a way that the compatibility between Java and Scala is respected, your example will work

val theRow =Row("1",Array[java.lang.Integer](1,2,3), Array[Double](0.1,0.4,0.5))
val theRdd = sc.makeRDD(Array(theRow))
case class X(id: String, indices: Array[Integer], weights: Array[Double] )
val df=theRdd.map{
    case Row(s0,s1,s2)=>X(s0.asInstanceOf[String],s1.asInstanceOf[Array[Integer]],s2.asInstanceOf[Array[Double]])
  }.toDF()
df.show()

//+---+---------+---------------+
//| id|  indices|        weights|
//+---+---------+---------------+
//|  1|[1, 2, 3]|[0.1, 0.4, 0.5]|
//+---+---------+---------------+

answered Feb 13, 2016 at 20:21

Christian Hirsch

2,06615 silver badges16 bronze badges

1 Comment

shakedzy Over a year ago

Note that you need to import sqlContext.implicits._ in order to use toDF

Collectives™ on Stack Overflow

How to create a sample dataframe in Scala / Spark

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related