0

how can I build a Spark dataframe from a string which contains XML code?

I can easily do it, if the code is saved in a file

dfXml = (sqlContext.read.format("xml")
           .options(rowTag='my_row_tag')
           .load(xml_file_name))

However as said I have to build the dataframe from a string which contains regular XML.

Thank you

Mauro

2 Answers 2

1

You can parse xml string without spark xml connector. Using below udf, You can convert xml string into json & then do your transformations on that.

I have taken one sample xml string & stored in catalog.xml file.

/tmp> cat catalog.xml
<?xml version="1.0"?><catalog><book id="bk101"><author>Gambardella, Matthew</author><title>XML Developer's Guide</title><genre>Computer</genre><price>44.95</price><publish_date>2000-10-01</publish_date><description>An in-depth look at creating applications with XML.</description></book></catalog>
<?xml version="1.0"?><catalog><book id="bk102"><author>Ralls, Kim</author><title>Midnight Rain</title><genre>Fantasy</genre><price>5.95</price><publish_date>2000-12-16</publish_date><description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description></book></catalog>


Please note below code is in scala, This will help you to implement same logic in python.

scala> val df = spark.read.textFile("/tmp/catalog.xml")
df: org.apache.spark.sql.Dataset[String] = [value: string]

scala> import org.json4s.Xml.toJson
import org.json4s.Xml.toJson

scala> import org.json4s.jackson.JsonMethods.{compact, parse}
import org.json4s.jackson.JsonMethods.{compact, parse}

scala> :paste
// Entering paste mode (ctrl-D to finish)

implicit class XmlToJson(data: String) {
    def json(root: String) = compact {
      toJson(scala.xml.XML.loadString(data)).transformField {
        case (field,value) => (field.toLowerCase,value)
      } \ root.toLowerCase
    }
    def json = compact(parse(data))
  }

val parseUDF = udf { (data: String,xmlRoot: String) => data.json(xmlRoot.toLowerCase)}


// Exiting paste mode, now interpreting.

defined class XmlToJson
parseUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,StringType,Some(List(StringType, StringType)))

scala> val json = df.withColumn("value",parseUDF($"value",lit("catalog")))
json: org.apache.spark.sql.DataFrame = [value: string]

scala> val json = df.withColumn("value",parseUDF($"value",lit("catalog"))).select("value").map(_.getString(0))
json: org.apache.spark.sql.Dataset[String] = [value: string]

scala> val bookDF = spark.read.json(json).select("book.*")
bookDF: org.apache.spark.sql.DataFrame = [author: string, description: string ... 5 more fields]

scala> bookDF.printSchema
root
 |-- author: string (nullable = true)
 |-- description: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- id: string (nullable = true)
 |-- price: string (nullable = true)
 |-- publish_date: string (nullable = true)
 |-- title: string (nullable = true)


scala> bookDF.show(false)
+--------------------+--------------------------------------------------------------------------------------------------------------------+--------+-----+-----+------------+---------------------+
|author              |description                                                                                                         |genre   |id   |price|publish_date|title                |
+--------------------+--------------------------------------------------------------------------------------------------------------------+--------+-----+-----+------------+---------------------+
|Gambardella, Matthew|An in-depth look at creating applications with XML.                                                                 |Computer|bk101|44.95|2000-10-01  |XML Developer's Guide|
|Ralls, Kim          |A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.|Fantasy |bk102|5.95 |2000-12-16  |Midnight Rain        |
+--------------------+--------------------------------------------------------------------------------------------------------------------+--------+-----+-----+------------+---------------------+

Sign up to request clarification or add additional context in comments.

Comments

1

On Scala, class "XmlReader" can be used for convert RDD[String] to DataFrame:

    val result = new XmlReader().xmlRdd(spark, rdd)

If you have Dataframe as input, it can be converted to RDD[String] easily.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.