0

I have deleted two of my question because i thought i was too big and i could not explained it neatly .

So i am trying to make it simple this time .

So i have an complex nested xml . I am parsing it in spark scala and i have to save all the data from the xml into text file .

NOTE:I need to save the data into text files because later i have to join this data with another file which is in text format . Also can i join my csv file format with json or perquet file format ?If yes then i may not need to convert my xml into text file .

This is my code where i am trying to save the xml into csv file but as csv does not allow to save array type so i am getting error .

I am looking for some solution where i would be able to extarct all elements of an array and save it into text file .

def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("XML").setMaster("local");
    val sc = new SparkContext(conf); //Creating spark context
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)

    val df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:Body").load("C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML")
    val resDf = df.withColumn("FlatType", explode(df("env:ContentItem"))).select("FlatType.*")

    resDf.repartition(1).write
      .format("csv")//This does not support for array Type
      .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
      .option("nullValue", "")
      .option("delimiter", "\t")
      .option("quote", "\u0000")
      .option("header", "true")
      .save("C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML//output")

    // val resDf = df.withColumn("FlatType", when(df("env:ContentItem").isNotNull, explode(df("env:ContentItem"))))
  }

This is producing me below output before saving

+---------+--------------------+
|  _action|            env:Data|
+---------+--------------------+
|   Insert|[fun:FundamentalD...|
|Overwrite|[sr:FinancialSour...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
+---------+--------------------+

Foe each unique env:Data i am expecting unique file that can be done using partition but how can i save it in text file .

I have to save all the elements from the array i mean all columns .

I hope this time i am making my question clear .

If required i can update schema also .

4
  • Why do you want to save these as CSV ? CSV is fundamentally meant to be a flat data format. Why not use Json ? Commented Feb 5, 2018 at 13:05
  • @SarveshKumarSingh later i have to perform join on a files which is csv .. Commented Feb 6, 2018 at 3:42
  • is this question related to stackoverflow.com/questions/48987566/… ? Commented Mar 1, 2018 at 7:25
  • @RameshMaharjan haan yes sir ...You have Already answered that..I can not delete because it has answer ... Commented Mar 1, 2018 at 7:32

1 Answer 1

0

Spark SQL has a direct write to csv option. Why not use that?

Here is the syntax:

resDf.write.option("your options").csv("output file path")

This should save your file directly to csv format.

Sign up to request clarification or add additional context in comments.

6 Comments

csv does not support data type array
you are writing a dataframe to csv file right? That's what I understood from last part of your code
yes but we can not do that because csv does not allow array type .So my question is that how can we convert this type of xml into text or csv and then write into text file
I'm a bit confused here. You have created a dataframe 'df', then you applied some transformations and created a new dataframe resDf. And in the last part, you are writing resDf dataframe to csv. Right?
Where exactly is it not working? While converting to dataframe or while writing to csv? Also, for joining, I suggest you to load both the files as dataframes, create view over those dataframes using registerTempTable, and then you can directly use sql join query.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.