0

I am using spark 2.3.2 with python 3.7 to parse xml.

In an xml file (sample), I have appended 2 xmls.

When I parse it with:

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.7.0 pyspark-shell'

conf = pyspark.SparkConf()
sc = SparkSession.builder.config(conf=conf).getOrCreate()
spark = SQLContext(sc)

dfSample = (spark.read.format("xml").option("rowTag", "xocs:doc")
.load(r"sample.xml"))

I see 2 xmls' data:

enter image description here

However, what I need is to extract the info under "ref-info" tag (along with their corresponding key eids), so my code is:

(dfSample.
 withColumn("metaExp", F.explode(F.array("xocs:meta"))).
 withColumn("eid", F.col("metaExp.xocs:eid")).
 select("eid","xocs:item").
 withColumn("xocs:itemExp", F.explode(F.array("xocs:item"))).
 withColumn("item", F.col("xocs:itemExp.item")).
 
 withColumn("itemExp", F.explode(F.array("item"))).
 withColumn("bibrecord", F.col("item.bibrecord")).
 
 withColumn("bibrecordExp", F.explode(F.array("bibrecord"))).
 withColumn("tail", F.col("bibrecord.tail")).
 
 withColumn("tailExp", F.explode(F.array("tail"))).
 withColumn("bibliography", F.col("tail.bibliography")).
 
 withColumn("bibliographyExp", F.explode(F.array("bibliography"))).
 withColumn("reference", F.col("bibliography.reference")).
 
 withColumn("referenceExp", F.explode(F.array("reference"))).
 withColumn("ref-infoExp", F.explode(F.col("reference.ref-info"))).
 
withColumn("authors", F.explode(F.col("ref-infoExp.ref-authors.author"))).
withColumn("py", (F.col("ref-infoExp.ref-publicationyear._first"))).
withColumn("so", (F.col("ref-infoExp.ref-sourcetitle"))).
withColumn("ti", (F.col("ref-infoExp.ref-title"))).
 drop("xocs:item", "xocs:itemExp", "item", "itemExp", "bibrecord", "bibrecordExp", "tail", "tailExp", "bibliography", 
      "bibliographyExp", "reference", "referenceExp").show())

This extracts the info only from the xml with eid = 85082880163

When I delete this one and only kept the one with eid = 85082880158, it works.

My file is an xml file containing those 2 lines in the link. I have also tried to merge those 2 into one xml but could not manage.

What is wrong with my data/approach? (My ultimate plan is to create such a file containing thousands of different xmls to be parsed)

2
  • 1
    Try to set up a minimal reproducible example, so that others might help you repeating your process :) Commented Nov 14, 2022 at 15:59
  • 1
    Please share your sample XML file contents. Commented Nov 15, 2022 at 6:08

1 Answer 1

2
+50

Your trouble lies in this piece of XML:

<xocs:item>
  <item>
    <bibrecord>
      <head>
        <abstracts>
          <abstract original="y" xml:lang="eng">
            <ce:para>
              Apple is often affected by [...] intercellular CO <inf>2</inf> concentration [...]
                                                                ^^^^^^^^^^^^

While it is correct XML (mixed content), this embedded tag <inf>2</inf> seems to be breaking spark-xml parser. If you remove it (or convert to corresponding HTML entities) you will get correct results.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.