Reading XML in pyspark with same root and row tags

this is a part of my XML file with all the necessary depth:

<?xml version="1.0" encoding="UTF-8" ?>
<Taxonomy>
    <TaxonomyNode>
        <Entity>BUSINESS</Entity>
        <Description>Business News</Description>
        <TaxonomyNode>
            <Entity>COS</Entity>
            <Description>Company News</Description>
            <TaxonomyNode>
                <Entity>ANA</Entity>
                <Description>Analyst Ratings &amp; Commentary</Description>
                <TaxonomyNode>
                    <Entity>ANABUY</Entity>
                    <Description>Analyst Ratings - Buys</Description>
                    <TaxonomyNode>
                        <Entity>ANABEVT</Entity>
                        <Description>Analyst Ratings Events, Announcements - Buys</Description>
                    </TaxonomyNode>
                    <TaxonomyNode>
                        <Entity>BMRANABUY</Entity>
                        <Description>Analyst Ratings - Buys</Description>
                        <TaxonomyNode>
                            <Entity>ANRACC</Entity>
                            <Description>ANR Accumulate</Description>
                        </TaxonomyNode>
                    </TaxonomyNode>
                </TaxonomyNode>
           </TaxonomyNode>
       </TaxonomyNode>
   </TaxonomyNode> 
</Taxonomy>

as you can see we have multiple rows with the same name, and reading this with spark with the conventional spark.read.format("com.databricks.spark.xml").option("rowTag","TaxonomyNode").load(completeXMLFilePath) is not working, it is returning me a dataframe looking like this:

and that has a schema like this:

I would be thankful if anybody has an idea on how to make this thing work

asked Jun 2, 2020 at 13:57

Arrajj

1873 silver badges14 bronze badges

Your XML is nested, not flat. It's being read correctly. Do you want to flatten it? If so, why?

Dave
– Dave

2020-06-02 14:43:26 +00:00
Commented Jun 2, 2020 at 14:43
yes that's true, i want to flatten it well to get my data properly

Arrajj
– Arrajj

2020-06-03 07:00:01 +00:00
Commented Jun 3, 2020 at 7:00
1

Take a look at this answer: stackoverflow.com/a/49672982/6030951

Dave
– Dave

2020-06-04 14:53:05 +00:00
Commented Jun 4, 2020 at 14:53

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Reading XML in pyspark with same root and row tags

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked