6
val schema = df.schema
val x = df.flatMap(r =>
  (0 until schema.length).map { idx =>
    ((idx, r.get(idx)), 1l)
  }
)

This produces the error

java.lang.ClassNotFoundException: scala.Any

I am not sure why, any help?

4
  • Can you please try to rebuild your project? This seems to be more of an indexing issue in your editor Commented Dec 13, 2018 at 17:14
  • This is executed on databricks spark engine, there is no "rebuild" @ChaitanyaWaikar Commented Dec 13, 2018 at 17:16
  • 1
    The Row.get method returns a value of type Any since it doesn't know the type, but Any is not serializable and not a valid Spark structured type. You could use r.getString(idx) If you are expecting each record to be a String Commented Dec 13, 2018 at 17:23
  • I need each type to come as expected in the schema, is there no way to do that @TomLous? Commented Dec 13, 2018 at 17:39

1 Answer 1

2

One way is to cast all columns to String. Note that I'm changing the r.get(idx) to r.getString(idx) in your code. The below works.

scala> val df = Seq(("ServiceCent4","AP-1-IOO-PPP","241.206.155.172","06-12-18:17:42:34",162,53,1544098354885L)).toDF("COL1","COL2","COL3","EventTime","COL4","COL5","COL6")
df: org.apache.spark.sql.DataFrame = [COL1: string, COL2: string ... 5 more fields]

scala> df.show(1,false)
+------------+------------+---------------+-----------------+----+----+-------------+
|COL1        |COL2        |COL3           |EventTime        |COL4|COL5|COL6         |
+------------+------------+---------------+-----------------+----+----+-------------+
|ServiceCent4|AP-1-IOO-PPP|241.206.155.172|06-12-18:17:42:34|162 |53  |1544098354885|
+------------+------------+---------------+-----------------+----+----+-------------+
only showing top 1 row

scala> df.printSchema
root
 |-- COL1: string (nullable = true)
 |-- COL2: string (nullable = true)
 |-- COL3: string (nullable = true)
 |-- EventTime: string (nullable = true)
 |-- COL4: integer (nullable = false)
 |-- COL5: integer (nullable = false)
 |-- COL6: long (nullable = false)


scala> val schema = df.schema
schema: org.apache.spark.sql.types.StructType = StructType(StructField(COL1,StringType,true), StructField(COL2,StringType,true), StructField(COL3,StringType,true), StructField(EventTime,StringType,true), StructField(COL4,IntegerType,false), StructField(COL5,IntegerType,false), StructField(COL6,LongType,false))

scala> val df2 = df.columns.foldLeft(df){ (acc,r) => acc.withColumn(r,col(r).cast("string")) }
df2: org.apache.spark.sql.DataFrame = [COL1: string, COL2: string ... 5 more fields]

scala> df2.printSchema
root
 |-- COL1: string (nullable = true)
 |-- COL2: string (nullable = true)
 |-- COL3: string (nullable = true)
 |-- EventTime: string (nullable = true)
 |-- COL4: string (nullable = false)
 |-- COL5: string (nullable = false)
 |-- COL6: string (nullable = false)


scala> val x = df2.flatMap(r => (0 until schema.length).map { idx => ((idx, r.getString(idx)), 1l) } )
x: org.apache.spark.sql.Dataset[((Int, String), Long)] = [_1: struct<_1: int, _2: string>, _2: bigint]

scala> x.show(5,false)
+---------------------+---+
|_1                   |_2 |
+---------------------+---+
|[0,ServiceCent4]     |1  |
|[1,AP-1-IOO-PPP]     |1  |
|[2,241.206.155.172]  |1  |
|[3,06-12-18:17:42:34]|1  |
|[4,162]              |1  |
+---------------------+---+
only showing top 5 rows


scala>
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.