Parsing an XML document in Perl

Question

I have a bizarre XML document arranged in the following manner

<a>
   <b>
     <c c1="blah" c2="blah">
        <d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
        <d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
        <d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
     </c>
     <c c1="blahc" c2="blah">
        <d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
        <d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
        <d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
     </c>
    ...
  <b>
    ....
  </b>
  <e/>
</a>

I want to extract the values of d2, d4, d5 for all the c nodes within all the b nodes.

I tried using XML::Simple and ran into a lot of difficulties with array referencing. I tried using XML::DOM, but considering my XML file is 500MB in size, it does not seem to be a good option. Please suggest a good approach as I'm new to Perl

Could you be more specific about the problems you had with XML::Simple? — beresfordt
– beresfordt, Commented Jul 6, 2012 at 14:53
in some of the xml files the 'b' nodes are empty which means I have an array referencing error every time this kind of xml file is processed. — pratz
– pratz, Commented Jul 6, 2012 at 14:59
You could add a conditional before attempting to access it; do a if( defined {path to node}){ do whatever } — beresfordt
– beresfordt, Commented Jul 6, 2012 at 15:02
500MB isn't a lot of in-memory data for current machines, so XML::DOM would be a valid choice. The main choice really depends on whether you prefer DOM or XPath, or something non-standard like XML::Twig. By the way, with "Please suggest a good approach as I'm new to Perl" are you suggesting we should reserve our poor suggestions for experienced Perl programmers?! — Borodin
– Borodin, Commented Jul 6, 2012 at 17:12

Borodin · Accepted Answer · 2012-07-06 17:14:18Z

2

Your question is a bit confusing, you want the attributes for the d element, not for the c element. Or maybe you want the values of the attributes no matter what the element under c is.

In any case, especially if the file is big, this looks like a good match for XML::Twig:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

XML::Twig->new( twig_handlers => { 'b/c/*' => \&get_atts })
         ->parse( \*DATA); # replace by parsefile( 'my.xml') 

sub get_atts
  { my( $t, $elt)= @_;
    foreach my $att ( qw( d2 d4 d5))
      { print "$att: ", $elt->att( $att), " "; }
    print "\n";
    $t->purge; # this frees the memory so you keep at most 1 d element 
  }

__DATA__
<a>
   <b>
     <c c1="blah" c2="blah">
        <d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
        <d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
        <d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
     </c>
     <c c1="blahc" c2="blah">
        <d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
        <d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
        <d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
     </c>
  </b>
  <b>
  </b>
  <e/>
</a>

If the attributes are always in d elements, replace 'b/c/*' with 'b/c/d', which will be more efficient.

edited Jul 6, 2012 at 17:14

Borodin

127k9 gold badges72 silver badges146 bronze badges

answered Jul 6, 2012 at 15:25

mirod

16.2k3 gold badges49 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Borodin Over a year ago

@mirod: I'm not sure 500MB counts as "big" but XML::Twig is a fine choice.

Borodin · Accepted Answer · 2012-07-06 17:02:57Z

1

There are many XML modules in CPAN that will help you with this, but in this case my money is on XML::XPath, which allows you to succinctly describe the data you want to extract from the XML.

This program uses you sample data and provides the output I think you want (although strictly there are no d="xx" attributes for any <c> nodes).

use strict;
use warnings;

use feature 'say';

use XML::XPath;

my $xml = XML::XPath->new(ioref => \*DATA);

for my $cnode ($xml->find('//b/c/d')->get_nodelist) {
  for ($cnode->find('@d2|@d4|@d5')->get_nodelist) {
    print $_->getData, "\n";
  }
}

__DATA__
<a>
   <b>
     <c c1="blah" c2="blah">
        <d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
        <d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
        <d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
     </c>
     <c c1="blahc" c2="blah">
        <d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
        <d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
        <d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
     </c>
    ...
  </b>
  <e/>
</a>

output

blah1
blah3
blah4
blah6
blah8
blah9
blah11
blah13
blah14
blah1
blah3
blah4
blah6
blah8
blah9
blah11
blah13
blah14

answered Jul 6, 2012 at 17:02

Borodin

127k9 gold badges72 silver badges146 bronze badges

3 Comments

mirod Over a year ago

500MB in XML::XPath would probably take about 5G, so I don't think that would be such a great idea

Borodin Over a year ago

@mirod: you're quite right. I have generated a 500MB file containing data similar to the OP's and both XML::DOM and XML::XPath reached 12GB on my 64-bit Perl v5.14 before dying with Out of memory! while trying to read it all in.

mirod Over a year ago

the last time I checked expansion factors I was using 32-bit perls. It makes sense that 64-bit would require even more memory.

choroba · Accepted Answer · 2012-07-12 02:41:15Z

1

Using xsh:

for a/b/c/d ls (@d2 | @d4 | @d5);

Update: (for mirod): Using XML::XSH2 from Perl is less elegant, but can still work -

#!/usr/bin/perl
use strict;
use warnings;

use XML::XSH2;

xsh q{
    open 1.xml ;
    for /a/b/c/d {
        for my $attr in (@d2 | @d4 | @d5) {
            perl { push @ar, $attr }
        }
    }
};

printf "%s:%s\n", $_->name, $_->value for @XML::XSH2::Map::ar;

Or, let Perl write the xsh code for you:

#!/usr/bin/perl
use warnings;
use strict;

use XML::XSH2;

xsh 'open 1.xml';
xsh '$attributes = (' . join('|', map 'a/b/c/@d' . $_, 1, 2, 4) . ')';
for (@$XML::XSH2::Map::attributes) {
    print $_->name, '=', $_->value, "\n";
}

edited Jul 12, 2012 at 2:41

answered Jul 6, 2012 at 15:42

choroba

245k27 gold badges221 silver badges304 bronze badges

3 Comments

mirod Over a year ago

I often see your answers using xsh. They are usually very elegant. How easy is it to use xsh from within code though? Usually the goal is not to just print the results, but to do something with them, so if you could show how to get the results in a data structure within Perl code, that would be great (something beyond using a pipe would be better ;--)

mirod Over a year ago

That's interesting. It's a clever way to pass data from xsh to Perl. Thanks a lot.

choroba Over a year ago

@mirod: Another update (sorry for the delay, I am quite busy lately...)

Collectives™ on Stack Overflow

Parsing an XML document in Perl

3 Answers 3

1 Comment

3 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related