0

I have csv stored in s3 location which has data like this

column1 | column2 | 
--------+----------
| adsf  | 2000.0  |   
| fff   | 232.34  | 

I have a AWS Glue Job in Scala which reads this file into a dataframe

var srcDF= glueContext.getCatalogSource(database = '',
                                        tableName = '',
                                        redshiftTmpDir = "",
                                        transformationContext = "").getDynamicFrame().toDF()

When I print the schema, it infers itself like this

srcDF.printSchema()

|-- column1 : string | 
|-- column2 : struct (double, string) | 

And the dataframe looks like

column1 | column2    | 
--------+-------------
| adsf  | [2000.0,]  |   
| fff   | [232.34,]  | 

When I try to save the dataframe to csv it complains that

org.apache.spark.sql.AnalysisException CSV data source does not support struct<double:double,string:string> data type.

How do I convert dataframe so that only the columns of Struct type (if exist) to decimal type? Output like this

column1 | column2 | 
--------+----------
| adsf | 2000.0   |   
| fff  | 232.34   | 

Edit:

Thanks for the response. I have tried using following code

df.select($"column2._1".alias("column2")).show()

But got the same error for both

org.apache.spark.sql.AnalysisException No such struct field _1 in double, string;

Edit 2:

It seems the spark, the columns were flattened and renamed as "double,string"

So, this solution worked for me

df.select($"column2.double").show()

2 Answers 2

1

You can extract fields from struct using getItem. Code can be something like that:

import spark.implicits._
import org.apache.spark.sql.functions.{col, getItem}

val df = Seq(
  ("adsf", (2000.0,"")),
  ("fff", (232.34,""))
).toDF("A", "B")
df.show()
df.select(col("A"), col("B").getItem("_1").as("B")).show()

it will print:

before select:
+----+----------+
|   A|         B|
+----+----------+
|adsf|[2000.0, ]|
| fff|[232.34, ]|
+----+----------+

after select:
+----+------+
|   A|     B|
+----+------+
|adsf|2000.0|
| fff|232.34|
+----+------+
Sign up to request clarification or add additional context in comments.

Comments

1

You can also use the dot notation column2._1 to get the struct field by name:

val df = Seq(
  ("adsf", (2000.0,"")),
  ("fff", (232.34,""))
).toDF("column1", "column2")

df.show
+-------+----------+
|column1|   column2|
+-------+----------+
|   adsf|[2000.0, ]|
|    fff|[232.34, ]|
+-------+----------+

val df2 = df.select($"column1", $"column2._1".alias("column2"))

df2.show
+-------+-------+
|column1|column2|
+-------+-------+
|   adsf| 2000.0|
|    fff| 232.34|
+-------+-------+

df2.coalesce(1).write.option("header", "true").csv("output")

and your csv file will be in the output/ folder:

column1,column2
adsf,2000.0
fff,232.34

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.