7

I have an array called array list which looks like this

arraylist: Array[(String, Any)] = Array((id,772914), (x4,2), (x5,24), (x6,1), (x7,77491.25), (x8,17911.77778), (x9,225711), (x10,17), (x12,6), (x14,5), (x16,5), (x18,5.0), (x19,8.0), (x20,7959.0), (x21,676.0), (x22,228.5068871), (x23,195.0), (x24,109.6015511), (x25,965.0), (x26,1017.79043), (x27,2.0), (Target,1), (x29,13), (x30,735255.5), (x31,332998.432), (x32,38168.75), (x33,107957.5278), (x34,13), (x35,13), (x36,13), (x37,13), (x38,13), (x39,13), (x40,13), (x41,7), (x42,13), (x43,13), (x44,13), (x45,13), (x46,13), (x47,13), (x48,13), (x49,14.0), (x50,2.588435821), (x51,617127.5), (x52,414663.9738), (x53,39900.0), (x54,16743.15781), (x55,105000.0), (x56,52842.29076), (x57,25750.46154), (x58,8532.045819), (x64,13), (x66,13), (x67,13), (x68,13), (x69,13), (x70,13), (x71,13), (x73,13), (...

I want to convert it to a dataframe with two columns "ID" and value. Fo theis the code I am using is

val df = sc.parallelize(arraylist).toDF("Names","Values")

However I am getting an error

java.lang.UnsupportedOperationException: Schema for type Any is not supported

How can I overcome this problem?

2 Answers 2

11

Message tells you everything :) Any is not supported as a type of column of DataFrame. Any type can be caused by nulls as the second element of a tuple

Change arraylist type to Array[(String, Int)] (if you can do it manually; if it is deducted by Scala, then check for nulls and invalid values of second element) or create manually schema:

import org.apache.spark.sql.types._
import org.apache.spark.sql._

val arraylist: Array[(String, Any)] = Array(("id",772914), ("x4",2.0), ("x5",24.0));

val schema = StructType(
    StructField("Names", StringType, false) ::
    StructField("Values", DoubleType, false) :: Nil)
val rdd = sc.parallelize (arraylist).map (x => Row(x._1, x._2.asInstanceOf[Number].doubleValue()))

val df = sqlContext.createDataFrame(rdd, schema)

df.show()

Note: createDataFrame requires RDD[Row], so I'm converting RDD of tuple to RDD of Row

Sign up to request clarification or add additional context in comments.

2 Comments

Finally after hours of head banging. Will be eternally grateful :)
@RajarshiBhadra Mapping to RDD[Row] as tricky, I've also forget about it at first ;) Just after a while I've checked my codes and see the difference
2

The problem (as stated) is that Any is not a legal type to dataframe. In general legal types are primitive types (byte, int, boolean, string, double etc.), structs of legal types, arrays of legal types and maps of legal types

In your case it seems as if you used both integer and double in the second value of the tuple. If you use instead just double then it should work properly.

you can do this in two ways: 1. Make sure the original array has just double (e.g. by adding .0 at the end of each integer when you create it) or by doing a cast 2. Enforce the schema:

import org.apache.spark.sql.types._
val schema = new StructType()
schema.add(StructField("names",StringType))
schema.add(StructField("values",DoubleType))
val rdd = sc.parallelize(arraylist).map (x => Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
val df = spark.createDataFrame(rdd,schema)

6 Comments

I am getting this error import org.apache.spark.sql.types._ <console>:39: error: overloaded method value apply with alternatives: (fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and> (fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and> (fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType cannot be applied to () val schema = StructType() ^
@RajarshiBhadra Second argument of StructField should not have () - see my answer
This assumes you ran using spark-shell on spark 2.0.0 or higher. If you are running an older version replace it with sqlContext
with is I got this error (rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame cannot be applied to (org.apache.spark.rdd.RDD[(String, Any)], org.apache.spark.sql.types.StructType) val df = sqlContext.createDataFrame(rdd,schema)
@RajarshiBhadra I've posted full, working, tested code
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.