1

I have the following XML:

<TABLES>
    <TABLE attrname="Red">
        <ROWDATA>
            <ROW Type="solid" track="0" Unit="0"/>
        </ROWDATA>
    </TABLE>
    <TABLE attrname="Blue">
        <ROWDATA>
            <ROW Type="light" track="0" Unit="0"/>
            <ROW Type="solid" track="0" Unit="0"/>
            <ROW Type="solid" track="0" Unit="0"/>
        </ROWDATA>
    </TABLE>

I am using Spark and Scala. I want to read each field in the ROW tag and differentiate by the attribute names. Currently the code below just reads all the values inside the ROW tag but I want to read them based on the attribute names.

val df = session.read
  .option("rowTag", "ROW")
  .xml(filePath)

df.show(10)
df.printSchema()

Thanks in advance.

1
  • Can you add expected output ? Commented May 22, 2021 at 0:28

1 Answer 1

2

Check below code.

 val spark = SparkSession.builder().master("local").appName("xml").getOrCreate()

  import com.databricks.spark.xml._
  import org.apache.spark.sql.functions._
  import spark.implicits._

   val xmlDF = spark.read
     .option("rowTag", "TABLE")
     .xml(xmlPath)
     .select(explode_outer($"ROWDATA.ROW").as("row"),$"_attrname".as("attrname"))
     .select(
       $"row._Type".as("type"),
       $"row._VALUE".as("value"),
       $"row._Unit".as("unit"),
       $"row._track".as("track"),
       $"attrname"
     )

  xmlDF.printSchema()
  xmlDF.show(false)

Schema

root
 |-- type: string (nullable = true)
 |-- value: string (nullable = true)
 |-- unit: long (nullable = true)
 |-- track: long (nullable = true)
 |-- attrname: string (nullable = true)

Sample Data

+-----+-----+----+-----+--------+
|type |value|unit|track|attrname|
+-----+-----+----+-----+--------+
|solid|null |0   |0    |Red     |
|light|null |0   |0    |Blue    |
|solid|null |0   |0    |Blue    |
|solid|null |0   |0    |Blue    |
+-----+-----+----+-----+--------+
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, this is the desired output. I tried the code but I get the following error's in Line 4 select statement: Cannot resolve overloaded method 'select' Type mismatch. Required: Column, found: UnresolvedAttribute Please can you post you import list as well?
Updated, Check now.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.