how to parse xml to get specific nodes having particular value for an attribute

Question

In the below xml using perl or python(which ever is fastest) I want a way to get all nodes/node names that have attribute1 set to "characters" and attribute2 not set to "chr" or dont have attribute2 itself . Please keep in mind that my xml can have 500 nodes,so kindly suggest a faster way to get all nodes

<NODE attribute1="characters" attribute2="chr" name="node1">
  <content>
    value1
  </content>
</NODE>

<NODE attribute1="camera"  name="node2">
  <content>
    value2
  </content>
</NODE>

<NODE attribute1="camera" attribute2="car" name="node3">
  <content>
    value2
  </content>
</NODE>

Hello. Usually it's asked of users that they at least give a sample of what they've done so far in solving a problem. — Sobrique
– Sobrique, Commented Jan 28, 2015 at 15:12

Kent · Accepted Answer · 2015-01-28 13:51:22Z

1

what you are looking for is a xpath expression:

//NODE[@attribute1="characters" and ( not(@attribute2) or @attribute2="chr")]

quick test with xmllint:

kent$  cat f.xml
<root>
<NODE attribute1="characters" attribute2="chr" name="node1">
  <content>
    value1
  </content>
</NODE>

<NODE attribute1="camera"  name="node2">
  <content>
    value2
  </content>
</NODE>

<NODE attribute1="camera" attribute2="car" name="node3">
  <content>
    value2
  </content>
</NODE>
</root>

kent$  xmllint --xpath '//NODE[@attribute1="characters" and ( not(@attribute2) or @attribute2="chr")]' f.xml
<NODE attribute1="characters" attribute2="chr" name="node1">
  <content>
    value1
  </content>
</NODE>

UPDATE

if you only want to extract the value of attribute name, you can use this xpath:

//NODE[@attribute1="characters" and ( not(@attribute2) or @attribute2="chr")]/@name

or string(//NODE[@attribute1="characters" and ( not(@attribute2) or @attribute2="chr")]/@name)

still test with xmllint:

kent$  xmllint --xpath '//NODE[@attribute1="characters" and ( not(@attribute2) or @attribute2="chr")]/@name' f.xml                                                          
 name="node1"

kent$  xmllint --xpath 'string(//NODE[@attribute1="characters" and ( not(@attribute2) or @attribute2="chr")]/@name)' f.xml
node1

edited Jan 28, 2015 at 13:51

answered Jan 28, 2015 at 13:09

Kent

197k36 gold badges248 silver badges317 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Dilip Over a year ago

Can you plz give me xpath where it will print only names of the nodes

Kent Over a year ago

@Dilip updated the answer, now the xpath extracts only the value of attribute name.

Sobrique · Accepted Answer · 2015-01-28 15:28:21Z

As you've tagged this as perl/python, I shall offer a perlish approach.

Perl has a nice library called XML::Twig which I really like for parsing XML.

#!/usr/bin/perl

use strict;
use warnings;
use XML::Twig;

my $parser = XML::Twig->new();

#would probably use parsefile instead.
#e.g.:
# my $parser = XML::Twig -> new -> parsefile ( 'your_file.xml' );
{
    local $/;
    $parser->parse(<DATA>);
}


#iterate all the elements in the file. 
foreach my $element ( $parser->root()->children() ) {

    #test your conditions
    if ($element->att('attribute1') eq 'characters'
        and ( not defined $element->att('attribute2')
                       or $element->att('attribute2') eq 'chr' )
        )
    {
        #extract name if condition matches
        print $element ->att('name'), "\n";
    }
}


__DATA__
<DATA>
  <NODE attribute1="characters" attribute2="chr" name="node1">
    <content>
      value1
    </content>
  </NODE>

  <NODE attribute1="camera"  name="node2">
    <content>
      value2
    </content>  
  </NODE>

  <NODE attribute1="camera" attribute2="car" name="node3">
    <content>
      value2
    </content>
  </NODE>
</DATA>

Vivek Sable · Accepted Answer · 2015-01-28 13:41:54Z

use lxml module.

content = """
<body>
<NODE attribute1="characters" attribute2="chr" name="node1">
  <content>
    value1
  </content>
</NODE>

<NODE attribute1="camera"  name="node2">
  <content>
    value2
  </content>
</NODE>

<NODE attribute1="camera" attribute2="car" name="node3">
  <content>
    value2
  </content>
</NODE>

<NODE attribute1="characters" attribute2="car" name="node3">
  <content>
    value2
  </content>
</NODE>

<NODE attribute1="characters" name="node3">
  <content>
    value2
  </content>
</NODE>

</body>
"""

from lxml import etree
root = etree.fromstring(content)
l = root.xpath('//*[@attribute1="characters" and ( not(@attribute2) or @attribute2!="chr") ]')
for i in l:
    print i.tag, i.attrib

output:

$ python test.py 
NODE {'attribute2': 'car', 'attribute1': 'characters', 'name': 'node3'}
NODE {'attribute1': 'characters', 'name': 'node3'}

Collectives™ on Stack Overflow

how to parse xml to get specific nodes having particular value for an attribute

3 Answers 3

UPDATE

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

UPDATE

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related