1

I have a bizarre XML document arranged in the following manner

<a>
   <b>
     <c c1="blah" c2="blah">
        <d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
        <d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
        <d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
     </c>
     <c c1="blahc" c2="blah">
        <d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
        <d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
        <d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
     </c>
    ...
  <b>
    ....
  </b>
  <e/>
</a>

I want to extract the values of d2, d4, d5 for all the c nodes within all the b nodes.

I tried using XML::Simple and ran into a lot of difficulties with array referencing. I tried using XML::DOM, but considering my XML file is 500MB in size, it does not seem to be a good option. Please suggest a good approach as I'm new to Perl

4
  • 3
    Could you be more specific about the problems you had with XML::Simple? Commented Jul 6, 2012 at 14:53
  • in some of the xml files the 'b' nodes are empty which means I have an array referencing error every time this kind of xml file is processed. Commented Jul 6, 2012 at 14:59
  • 1
    You could add a conditional before attempting to access it; do a if( defined {path to node}){ do whatever } Commented Jul 6, 2012 at 15:02
  • 1
    500MB isn't a lot of in-memory data for current machines, so XML::DOM would be a valid choice. The main choice really depends on whether you prefer DOM or XPath, or something non-standard like XML::Twig. By the way, with "Please suggest a good approach as I'm new to Perl" are you suggesting we should reserve our poor suggestions for experienced Perl programmers?! Commented Jul 6, 2012 at 17:12

3 Answers 3

2

Your question is a bit confusing, you want the attributes for the d element, not for the c element. Or maybe you want the values of the attributes no matter what the element under c is.

In any case, especially if the file is big, this looks like a good match for XML::Twig:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

XML::Twig->new( twig_handlers => { 'b/c/*' => \&get_atts })
         ->parse( \*DATA); # replace by parsefile( 'my.xml') 

sub get_atts
  { my( $t, $elt)= @_;
    foreach my $att ( qw( d2 d4 d5))
      { print "$att: ", $elt->att( $att), " "; }
    print "\n";
    $t->purge; # this frees the memory so you keep at most 1 d element 
  }

__DATA__
<a>
   <b>
     <c c1="blah" c2="blah">
        <d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
        <d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
        <d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
     </c>
     <c c1="blahc" c2="blah">
        <d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
        <d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
        <d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
     </c>
  </b>
  <b>
  </b>
  <e/>
</a>

If the attributes are always in d elements, replace 'b/c/*' with 'b/c/d', which will be more efficient.

Sign up to request clarification or add additional context in comments.

1 Comment

@mirod: I'm not sure 500MB counts as "big" but XML::Twig is a fine choice.
1

There are many XML modules in CPAN that will help you with this, but in this case my money is on XML::XPath, which allows you to succinctly describe the data you want to extract from the XML.

This program uses you sample data and provides the output I think you want (although strictly there are no d="xx" attributes for any <c> nodes).

use strict;
use warnings;

use feature 'say';

use XML::XPath;

my $xml = XML::XPath->new(ioref => \*DATA);

for my $cnode ($xml->find('//b/c/d')->get_nodelist) {
  for ($cnode->find('@d2|@d4|@d5')->get_nodelist) {
    print $_->getData, "\n";
  }
}

__DATA__
<a>
   <b>
     <c c1="blah" c2="blah">
        <d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
        <d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
        <d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
     </c>
     <c c1="blahc" c2="blah">
        <d d1="blah0" d2="blah1" d3="blah2" d4="blah3" d5="blah4" />
        <d d1="blah5" d2="blah6" d3="blah7" d4="blah8" d5="blah9" />
        <d d1="blah10" d2="blah11" d3="blah12" d4="blah13" d5="blah14" />
     </c>
    ...
  </b>
  <e/>
</a>

output

blah1
blah3
blah4
blah6
blah8
blah9
blah11
blah13
blah14
blah1
blah3
blah4
blah6
blah8
blah9
blah11
blah13
blah14

3 Comments

500MB in XML::XPath would probably take about 5G, so I don't think that would be such a great idea
@mirod: you're quite right. I have generated a 500MB file containing data similar to the OP's and both XML::DOM and XML::XPath reached 12GB on my 64-bit Perl v5.14 before dying with Out of memory! while trying to read it all in.
the last time I checked expansion factors I was using 32-bit perls. It makes sense that 64-bit would require even more memory.
1

Using xsh:

for a/b/c/d ls (@d2 | @d4 | @d5);

Update: (for mirod): Using XML::XSH2 from Perl is less elegant, but can still work -

#!/usr/bin/perl
use strict;
use warnings;

use XML::XSH2;

xsh q{
    open 1.xml ;
    for /a/b/c/d {
        for my $attr in (@d2 | @d4 | @d5) {
            perl { push @ar, $attr }
        }
    }
};

printf "%s:%s\n", $_->name, $_->value for @XML::XSH2::Map::ar;

Or, let Perl write the xsh code for you:

#!/usr/bin/perl
use warnings;
use strict;

use XML::XSH2;

xsh 'open 1.xml';
xsh '$attributes = (' . join('|', map 'a/b/c/@d' . $_, 1, 2, 4) . ')';
for (@$XML::XSH2::Map::attributes) {
    print $_->name, '=', $_->value, "\n";
}

3 Comments

I often see your answers using xsh. They are usually very elegant. How easy is it to use xsh from within code though? Usually the goal is not to just print the results, but to do something with them, so if you could show how to get the results in a data structure within Perl code, that would be great (something beyond using a pipe would be better ;--)
That's interesting. It's a clever way to pass data from xsh to Perl. Thanks a lot.
@mirod: Another update (sorry for the delay, I am quite busy lately...)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.