pyspark does not parse an xml from a file containing multiple xmls

Question

I am using spark 2.3.2 with python 3.7 to parse xml.

In an xml file (sample), I have appended 2 xmls.

When I parse it with:

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.7.0 pyspark-shell'

conf = pyspark.SparkConf()
sc = SparkSession.builder.config(conf=conf).getOrCreate()
spark = SQLContext(sc)

dfSample = (spark.read.format("xml").option("rowTag", "xocs:doc")
.load(r"sample.xml"))

I see 2 xmls' data:

However, what I need is to extract the info under "ref-info" tag (along with their corresponding key eids), so my code is:

(dfSample.
 withColumn("metaExp", F.explode(F.array("xocs:meta"))).
 withColumn("eid", F.col("metaExp.xocs:eid")).
 select("eid","xocs:item").
 withColumn("xocs:itemExp", F.explode(F.array("xocs:item"))).
 withColumn("item", F.col("xocs:itemExp.item")).
 
 withColumn("itemExp", F.explode(F.array("item"))).
 withColumn("bibrecord", F.col("item.bibrecord")).
 
 withColumn("bibrecordExp", F.explode(F.array("bibrecord"))).
 withColumn("tail", F.col("bibrecord.tail")).
 
 withColumn("tailExp", F.explode(F.array("tail"))).
 withColumn("bibliography", F.col("tail.bibliography")).
 
 withColumn("bibliographyExp", F.explode(F.array("bibliography"))).
 withColumn("reference", F.col("bibliography.reference")).
 
 withColumn("referenceExp", F.explode(F.array("reference"))).
 withColumn("ref-infoExp", F.explode(F.col("reference.ref-info"))).
 
withColumn("authors", F.explode(F.col("ref-infoExp.ref-authors.author"))).
withColumn("py", (F.col("ref-infoExp.ref-publicationyear._first"))).
withColumn("so", (F.col("ref-infoExp.ref-sourcetitle"))).
withColumn("ti", (F.col("ref-infoExp.ref-title"))).
 drop("xocs:item", "xocs:itemExp", "item", "itemExp", "bibrecord", "bibrecordExp", "tail", "tailExp", "bibliography", 
      "bibliographyExp", "reference", "referenceExp").show())

This extracts the info only from the xml with eid = 85082880163

When I delete this one and only kept the one with eid = 85082880158, it works.

My file is an xml file containing those 2 lines in the link. I have also tried to merge those 2 into one xml but could not manage.

What is wrong with my data/approach? (My ultimate plan is to create such a file containing thousands of different xmls to be parsed)

Try to set up a minimal reproducible example, so that others might help you repeating your process :) — Luca Soato
– Luca Soato, Commented Nov 14, 2022 at 15:59

Kombajn zbożowy · Accepted Answer · 2022-11-19 00:14:17Z

2

+50

Your trouble lies in this piece of XML:

<xocs:item>
  <item>
    <bibrecord>
      <head>
        <abstracts>
          <abstract original="y" xml:lang="eng">
            <ce:para>
              Apple is often affected by [...] intercellular CO <inf>2</inf> concentration [...]
                                                                ^^^^^^^^^^^^

While it is correct XML (mixed content), this embedded tag <inf>2</inf> seems to be breaking spark-xml parser. If you remove it (or convert to corresponding HTML entities) you will get correct results.

answered Nov 19, 2022 at 0:14

Kombajn zbożowy

10.8k5 gold badges33 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

pyspark does not parse an xml from a file containing multiple xmls

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related