Scala Spark : Convert struct columns type to decimal type

Question

I have csv stored in s3 location which has data like this

column1 | column2 | 
--------+----------
| adsf  | 2000.0  |   
| fff   | 232.34  |

I have a AWS Glue Job in Scala which reads this file into a dataframe

var srcDF= glueContext.getCatalogSource(database = '',
                                        tableName = '',
                                        redshiftTmpDir = "",
                                        transformationContext = "").getDynamicFrame().toDF()

When I print the schema, it infers itself like this

srcDF.printSchema()

|-- column1 : string | 
|-- column2 : struct (double, string) |

And the dataframe looks like

column1 | column2    | 
--------+-------------
| adsf  | [2000.0,]  |   
| fff   | [232.34,]  |

When I try to save the dataframe to csv it complains that

org.apache.spark.sql.AnalysisException CSV data source does not support struct<double:double,string:string> data type.

How do I convert dataframe so that only the columns of Struct type (if exist) to decimal type? Output like this

column1 | column2 | 
--------+----------
| adsf | 2000.0   |   
| fff  | 232.34   |

Edit:

Thanks for the response. I have tried using following code

df.select($"column2._1".alias("column2")).show()

But got the same error for both

org.apache.spark.sql.AnalysisException No such struct field _1 in double, string;

Edit 2:

It seems the spark, the columns were flattened and renamed as "double,string"

So, this solution worked for me

df.select($"column2.double").show()

Boris Azanov · Accepted Answer · 2020-12-04 22:43:22Z

1

You can extract fields from struct using getItem. Code can be something like that:

import spark.implicits._
import org.apache.spark.sql.functions.{col, getItem}

val df = Seq(
  ("adsf", (2000.0,"")),
  ("fff", (232.34,""))
).toDF("A", "B")
df.show()
df.select(col("A"), col("B").getItem("_1").as("B")).show()

it will print:

before select:
+----+----------+
|   A|         B|
+----+----------+
|adsf|[2000.0, ]|
| fff|[232.34, ]|
+----+----------+

after select:
+----+------+
|   A|     B|
+----+------+
|adsf|2000.0|
| fff|232.34|
+----+------+

answered Dec 4, 2020 at 22:43

Boris Azanov

4,5011 gold badge19 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mck · Accepted Answer · 2020-12-05 05:53:25Z

1

You can also use the dot notation column2._1 to get the struct field by name:

val df = Seq(
  ("adsf", (2000.0,"")),
  ("fff", (232.34,""))
).toDF("column1", "column2")

df.show
+-------+----------+
|column1|   column2|
+-------+----------+
|   adsf|[2000.0, ]|
|    fff|[232.34, ]|
+-------+----------+

val df2 = df.select($"column1", $"column2._1".alias("column2"))

df2.show
+-------+-------+
|column1|column2|
+-------+-------+
|   adsf| 2000.0|
|    fff| 232.34|
+-------+-------+

df2.coalesce(1).write.option("header", "true").csv("output")

and your csv file will be in the output/ folder:

column1,column2
adsf,2000.0
fff,232.34

answered Dec 5, 2020 at 5:53

mck

42.7k13 gold badges44 silver badges62 bronze badges

Collectives™ on Stack Overflow

Scala Spark : Convert struct columns type to decimal type

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related