0

I'm trying to push some academic POC to work that rely on pyspark with com.databricks:spark-xml. The goal is to load the Stack Exchange Data Dump xml format (https://archive.org/details/stackexchange) to pyspark df.

It works like a charm with correctly formatted xml with proper tags but fail with Stack Exchange Dump as follows:

<users>
  <row Id="-1" Reputation="1" CreationDate="2014-07-30T18:05:25.020" DisplayName="Community" LastAccessDate="2014-07-30T18:05:25.020" Location="on the server farm" AboutMe=" I feel pretty, Oh, so pretty" Views="0" UpVotes="26" DownVotes="701" AccountId="-1" />
</users>

Depending on the root tag, row tag I'm getting empty schema or..something:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.format('com.databricks.spark.xml').option("rowTag", "users").load('./tmp/test/Users.xml')
df.printSchema()
df.show()

root
 |-- row: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _AboutMe: string (nullable = true)
 |    |    |-- _AccountId: long (nullable = true)
 |    |    |-- _CreationDate: string (nullable = true)
 |    |    |-- _DisplayName: string (nullable = true)
 |    |    |-- _DownVotes: long (nullable = true)
 |    |    |-- _Id: long (nullable = true)
 |    |    |-- _LastAccessDate: string (nullable = true)
 |    |    |-- _Location: string (nullable = true)
 |    |    |-- _ProfileImageUrl: string (nullable = true)
 |    |    |-- _Reputation: long (nullable = true)
 |    |    |-- _UpVotes: long (nullable = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _Views: long (nullable = true)
 |    |    |-- _WebsiteUrl: string (nullable = true)

+--------------------+
|                 row|
+--------------------+
|[[Hi, I'm not ......|
+--------------------+
Spark          : 1.6.0
Python         : 2.7.15
Com.databricks : spark-xml_2.10:0.4.1

I would be extremely grateful for any advise.

Kind Regards, P.

1
  • Just to clarify, everything goes to the column row - so defining a customSchema with StructType will not work (all column are nulled). Commented Nov 30, 2018 at 20:33

1 Answer 1

1

I tried the same method (spark-xml on stackoverflow dump files) some time ago and I failed... Mostly because DF is seen as an array of structures and the processing performance was really bad. Instead, I recommend to use standard text reader and map Key="Value" in every line with UDF like this:

pattern = re.compile(' ([A-Za-z]+)="([^"]*)"')
parse_line = lambda line: {key:value for key,value in pattern.findall(line)}

You can also use my code to get the proper data types: https://github.com/szczeles/pyspark-notebooks/blob/master/stackoverflow/stackexchange-convert.ipynb (the schema matches dumps for March 2017).

Sign up to request clarification or add additional context in comments.

6 Comments

Hi Mariusz. I've managed to load it via com.databricks.spark.xml using udf and explode. Hovewer the performance is worst than bad. I've tried to use your idea with converting xml to parquet although failed on the first stackexchange-convert notebook cell with error "ImportError: No module named session". Any idea what's wrong ?
Ah, I just realized you're using spark 1.6. The code was prepared for spark 2.X branch, but it should be compatible with 1.6 if you replace the initialization of spark context with: from pyspark.sql import SQLContext; from pyspark import SparkContext; spark = SQLContext.getOrCreate(SparkContext.getOrCreate()). But anyway, spark 1.6 is pretty old, so if you're starting a new project I strongly encourage to try 2.4.
It works like a charm ! I've already voted for your answer but It's not public visible since I'm not eligable (I miss one point to the commentator badge). I will get back to this at the earliest convenience. Anyway - I own you at least a beer - I will be looking for you on the meetups and events :)
Hi Mariusz. Can you explain what the (' ([A-Za-z]+)="([^"]*)"') do ?
This regexp matches all key="value" pairs from XML rows. Key contains only A-Za-z characters, but value can contain any (but "). Next lambda uses is to create a map of key-value pairs for each row.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.