4

I have a data frame which comes as like below

+---------------------------------------------------------------------+
|value                                                                |
+---------------------------------------------------------------------+
|[WrappedArray(LineItem_organizationId, LineItem_lineItemId)]         |
|[WrappedArray(OrganizationId, LineItemId, SegmentSequence_segmentId)]|
+---------------------------------------------------------------------+

From the above two rows I want to create a string which is in this format

"LineItem_organizationId", "LineItem_lineItemId"
"OrganizationId", "LineItemId", "SegmentSequence_segmentId"

I want to create this as dynamic so in first column third value is present my string will have one more , separated columns value .

How can I do this in Scala .

this is what I am doing in order to create data frame

 val xmlFiles = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML"
    val discriptorFileLOcation = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//FinancialLineItem//REFXML"
    import sqlContext.implicits._

    val dfDiscriptor = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "FlatFileDescriptor").load(discriptorFileLOcation)
    dfDiscriptor.printSchema()
    val firstColumn = dfDiscriptor.select($"FFFileType.FFRecord.FFField").as("FFField")
    val FirstColumnOfHeaderFile = firstColumn.select(explode($"FFField")).as("ColumnsDetails").select(explode($"col")).first.get(0).toString().split(",")(5)
    println(FirstColumnOfHeaderFile)
    //dfDiscriptor.printSchema()
    val primaryKeyColumnsFinancialLineItem = dfDiscriptor.select(explode($"FFFileType.FFRecord.FFPrimKey.FFPrimKeyCol"))
    primaryKeyColumnsFinancialLineItem.show(false)

Adding the full schema

   root
 |-- FFColumnDelimiter: string (nullable = true)
 |-- FFContentItem: struct (nullable = true)
 |    |-- _VALUE: string (nullable = true)
 |    |-- _ffMajVers: long (nullable = true)
 |    |-- _ffMinVers: double (nullable = true)
 |-- FFFileEncoding: string (nullable = true)
 |-- FFFileType: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- FFPhysicalFile: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- FFFileName: string (nullable = true)
 |    |    |    |    |-- FFRowCount: long (nullable = true)
 |    |    |-- FFRecord: struct (nullable = true)
 |    |    |    |-- FFField: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- FFColumnNumber: long (nullable = true)
 |    |    |    |    |    |-- FFDataType: string (nullable = true)
 |    |    |    |    |    |-- FFFacets: struct (nullable = true)
 |    |    |    |    |    |    |-- FFMaxLength: long (nullable = true)
 |    |    |    |    |    |    |-- FFTotalDigits: long (nullable = true)
 |    |    |    |    |    |-- FFFieldIsOptional: boolean (nullable = true)
 |    |    |    |    |    |-- FFFieldName: string (nullable = true)
 |    |    |    |    |    |-- FFForKey: struct (nullable = true)
 |    |    |    |    |    |    |-- FFForKeyCol: string (nullable = true)
 |    |    |    |    |    |    |-- FFForKeyRecord: string (nullable = true)
 |    |    |    |-- FFPrimKey: struct (nullable = true)
 |    |    |    |    |-- FFPrimKeyCol: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- FFRecordType: string (nullable = true)
 |-- FFHeaderRow: boolean (nullable = true)
 |-- FFId: string (nullable = true)
 |-- FFRowDelimiter: string (nullable = true)
 |-- FFTimeStamp: string (nullable = true)
 |-- _env: string (nullable = true)
 |-- _ffMajVers: long (nullable = true)
 |-- _ffMinVers: double (nullable = true)
 |-- _ffPubstyle: string (nullable = true)
 |-- _schemaLocation: string (nullable = true)
 |-- _sr: string (nullable = true)
 |-- _xmlns: string (nullable = true)
 |-- _xsi: string (nullable = true)

1 Answer 1

3

Looking at your given dataframe

+---------------------------------------------------------------------+
|value                                                                |
+---------------------------------------------------------------------+
|[WrappedArray(LineItem_organizationId, LineItem_lineItemId)]         |
|[WrappedArray(OrganizationId, LineItemId, SegmentSequence_segmentId)]|
+---------------------------------------------------------------------+

it must have the following schema

 |-- value: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)

If the above assumption are true then you should write a udf function as

import org.apache.spark.sql.functions._
def arrayToString = udf((arr: collection.mutable.WrappedArray[collection.mutable.WrappedArray[String]]) => arr.flatten.mkString(", "))

And use it in the dataframe as

df.withColumn("value", arrayToString($"value"))

And you should have

+-----------------------------------------------------+
|value                                                |
+-----------------------------------------------------+
|LineItem_organizationId, LineItem_lineItemId         |
|OrganizationId, LineItemId, SegmentSequence_segmentId|
+-----------------------------------------------------+

 |-- value: string (nullable = true)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.