Pyspark issue loading xml files with com.databricks:spark-xml

Question

I'm trying to push some academic POC to work that rely on pyspark with com.databricks:spark-xml. The goal is to load the Stack Exchange Data Dump xml format (https://archive.org/details/stackexchange) to pyspark df.

It works like a charm with correctly formatted xml with proper tags but fail with Stack Exchange Dump as follows:

<users>
  <row Id="-1" Reputation="1" CreationDate="2014-07-30T18:05:25.020" DisplayName="Community" LastAccessDate="2014-07-30T18:05:25.020" Location="on the server farm" AboutMe=" I feel pretty, Oh, so pretty" Views="0" UpVotes="26" DownVotes="701" AccountId="-1" />
</users>

Depending on the root tag, row tag I'm getting empty schema or..something:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.format('com.databricks.spark.xml').option("rowTag", "users").load('./tmp/test/Users.xml')
df.printSchema()
df.show()

root
 |-- row: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _AboutMe: string (nullable = true)
 |    |    |-- _AccountId: long (nullable = true)
 |    |    |-- _CreationDate: string (nullable = true)
 |    |    |-- _DisplayName: string (nullable = true)
 |    |    |-- _DownVotes: long (nullable = true)
 |    |    |-- _Id: long (nullable = true)
 |    |    |-- _LastAccessDate: string (nullable = true)
 |    |    |-- _Location: string (nullable = true)
 |    |    |-- _ProfileImageUrl: string (nullable = true)
 |    |    |-- _Reputation: long (nullable = true)
 |    |    |-- _UpVotes: long (nullable = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _Views: long (nullable = true)
 |    |    |-- _WebsiteUrl: string (nullable = true)

+--------------------+
|                 row|
+--------------------+
|[[Hi, I'm not ......|
+--------------------+

Spark          : 1.6.0
Python         : 2.7.15
Com.databricks : spark-xml_2.10:0.4.1

I would be extremely grateful for any advise.

Kind Regards, P.

Just to clarify, everything goes to the column row - so defining a customSchema with StructType will not work (all column are nulled). — Przemysław Puchajda
– Przemysław Puchajda, Commented Nov 30, 2018 at 20:33

Mariusz · Accepted Answer · 2018-12-02 20:22:41Z

1

I tried the same method (spark-xml on stackoverflow dump files) some time ago and I failed... Mostly because DF is seen as an array of structures and the processing performance was really bad. Instead, I recommend to use standard text reader and map Key="Value" in every line with UDF like this:

pattern = re.compile(' ([A-Za-z]+)="([^"]*)"')
parse_line = lambda line: {key:value for key,value in pattern.findall(line)}

You can also use my code to get the proper data types: https://github.com/szczeles/pyspark-notebooks/blob/master/stackoverflow/stackexchange-convert.ipynb (the schema matches dumps for March 2017).

answered Dec 2, 2018 at 20:22

Mariusz

14k3 gold badges66 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Przemysław Puchajda Over a year ago

Hi Mariusz. I've managed to load it via com.databricks.spark.xml using udf and explode. Hovewer the performance is worst than bad. I've tried to use your idea with converting xml to parquet although failed on the first stackexchange-convert notebook cell with error "ImportError: No module named session". Any idea what's wrong ?

Mariusz Over a year ago

Ah, I just realized you're using spark 1.6. The code was prepared for spark 2.X branch, but it should be compatible with 1.6 if you replace the initialization of spark context with:

from pyspark.sql import SQLContext; from pyspark import SparkContext; spark = SQLContext.getOrCreate(SparkContext.getOrCreate())

. But anyway, spark 1.6 is pretty old, so if you're starting a new project I strongly encourage to try 2.4.

Przemysław Puchajda Over a year ago

It works like a charm ! I've already voted for your answer but It's not public visible since I'm not eligable (I miss one point to the commentator badge). I will get back to this at the earliest convenience. Anyway - I own you at least a beer - I will be looking for you on the meetups and events :)

Przemysław Puchajda Over a year ago

Hi Mariusz. Can you explain what the (' ([A-Za-z]+)="([^"]*)"') do ?

Mariusz Over a year ago

This regexp matches all key="value" pairs from XML rows. Key contains only A-Za-z characters, but value can contain any (but "). Next lambda uses is to create a map of key-value pairs for each row.

|

Collectives™ on Stack Overflow

Pyspark issue loading xml files with com.databricks:spark-xml

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related