How to save array data frame output from spark xml in csv format

Question

I have deleted two of my question because i thought i was too big and i could not explained it neatly .

So i am trying to make it simple this time .

So i have an complex nested xml . I am parsing it in spark scala and i have to save all the data from the xml into text file .

NOTE:I need to save the data into text files because later i have to join this data with another file which is in text format . Also can i join my csv file format with json or perquet file format ?If yes then i may not need to convert my xml into text file .

This is my code where i am trying to save the xml into csv file but as csv does not allow to save array type so i am getting error .

I am looking for some solution where i would be able to extarct all elements of an array and save it into text file .

def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("XML").setMaster("local");
    val sc = new SparkContext(conf); //Creating spark context
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)

    val df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:Body").load("C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML")
    val resDf = df.withColumn("FlatType", explode(df("env:ContentItem"))).select("FlatType.*")

    resDf.repartition(1).write
      .format("csv")//This does not support for array Type
      .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
      .option("nullValue", "")
      .option("delimiter", "\t")
      .option("quote", "\u0000")
      .option("header", "true")
      .save("C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML//output")

    // val resDf = df.withColumn("FlatType", when(df("env:ContentItem").isNotNull, explode(df("env:ContentItem"))))
  }

This is producing me below output before saving

+---------+--------------------+
|  _action|            env:Data|
+---------+--------------------+
|   Insert|[fun:FundamentalD...|
|Overwrite|[sr:FinancialSour...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
+---------+--------------------+

Foe each unique env:Data i am expecting unique file that can be done using partition but how can i save it in text file .

I have to save all the elements from the array i mean all columns .

I hope this time i am making my question clear .

If required i can update schema also .

Why do you want to save these as CSV ? CSV is fundamentally meant to be a flat data format. Why not use Json ? — sarveshseri
– sarveshseri, Commented Feb 5, 2018 at 13:05
@SarveshKumarSingh later i have to perform join on a files which is csv .. — Sudarshan kumar
– Sudarshan kumar, Commented Feb 6, 2018 at 3:42
is this question related to stackoverflow.com/questions/48987566/… ? — Anahcolus
– Anahcolus, Commented Mar 1, 2018 at 7:25
@RameshMaharjan haan yes sir ...You have Already answered that..I can not delete because it has answer ... — Sudarshan kumar
– Sudarshan kumar, Commented Mar 1, 2018 at 7:32

Shrinivas Deshmukh · Accepted Answer · 2018-02-06 06:17:35Z

0

Spark SQL has a direct write to csv option. Why not use that?

Here is the syntax:

resDf.write.option("your options").csv("output file path")

This should save your file directly to csv format.

answered Feb 6, 2018 at 6:17

Shrinivas Deshmukh

6975 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Sudarshan kumar Over a year ago

csv does not support data type array

Shrinivas Deshmukh Over a year ago

you are writing a dataframe to csv file right? That's what I understood from last part of your code

Sudarshan kumar Over a year ago

yes but we can not do that because csv does not allow array type .So my question is that how can we convert this type of xml into text or csv and then write into text file

Shrinivas Deshmukh Over a year ago

I'm a bit confused here. You have created a dataframe 'df', then you applied some transformations and created a new dataframe resDf. And in the last part, you are writing resDf dataframe to csv. Right?

Shrinivas Deshmukh Over a year ago

Where exactly is it not working? While converting to dataframe or while writing to csv? Also, for joining, I suggest you to load both the files as dataframes, create view over those dataframes using registerTempTable, and then you can directly use sql join query.

|

Collectives™ on Stack Overflow

How to save array data frame output from spark xml in csv format

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related