Extract attributes and values from XML file in perl

Question

This is part of the output XML file I get as output from Stanford CoreNLP:

<collapsed-ccprocessed-dependencies>  
      <dep type="nn">
        <governor idx="25">Mullen</governor>
        <dependent idx="24">Ms.</dependent>
      </dep>
      <dep type="nsubj">
        <governor idx="26">said</governor>
        <dependent idx="25">Mullen</dependent>
      </dep>
    </collapsed-ccprocessed-dependencies>
  </sentence>
</sentences>
<coreference>
  <coreference>
    <mention representative="true">
      <sentence>1</sentence>
      <start>1</start>
      <end>2</end>
      <head>1</head>
    </mention>
    <mention>
      <sentence>1</sentence>
      <start>33</start>
      <end>34</end>
      <head>33</head>
    </mention>
  </coreference>
 </coreference>
<mention representative="true">
      <sentence>1</sentence>
      <start>6</start>
      <end>9</end>
      <head>8</head>
    </mention>
    <mention>
      <sentence>1</sentence>
      <start>10</start>
      <end>11</end>
      <head>10</head>
    </mention>
  </coreference>
  <coreference>

How do I parse it using Perl so that I get something like this:

1. sentence 1, head 1
   sentence 1, head 33
2. sentence 1, head 8
   sentence 1, head 10

I have tried with XML::Simple but the output is not easily understandable. Here is what I did: use XML::Simple; use Data::Dumper;

$outfile = $filename.".xml";
$xml = new XML::Simple;

$data = $xml -> XMLin($outfile);
print Dumper($data);

You're going to have to show what you've tried so far.

kjprice
– kjprice

2013-04-08 21:24:06 +00:00
Commented Apr 8, 2013 at 21:24 — kjprice
– kjprice, Commented Apr 8, 2013 at 21:24

ikegami · Accepted Answer · 2013-04-08 21:28:19Z

4

XML::Simple has the hardest interface to use. You could use something like

use XML::LibXML qw( );

my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($xml);

my $coref_count;
for my $coref_node ($doc->findnodes('//coreference/coreference')) {
   ++$coref_count;

   my $mention_count;
   for my $mention_node ($coref_node->findnodes('mention')) {
      ++$mention_count;

      my $sentence = $mention_node->findvalue('sentence/text()');
      my $head     = $mention_node->findvalue('head/text()');

      my $prefix = "$coref_count.";
      $prefix = ' ' x length($prefix) if $mention_count == 1;

      print "$prefix sentence $sentence, head $head\n";
   }
}

answered Apr 8, 2013 at 21:28

ikegami

391k17 gold badges291 silver badges555 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user2154731 Over a year ago

Thanks. I keep getting a syntax error saying that the '<' tag is missing. So, I guess I am making a mistake in providing the input. I provided the filename as input. Could you guide me as to where I am going wrong?

ikegami Over a year ago

You can use ->parse_file($qfn) if you have a file

user2154731 Over a year ago

oh, I was using it as -> parsefile probably didn't work because of that. Thanks again!

Borodin · Accepted Answer · 2013-04-08 22:39:01Z

2

Regrettably, XML::Simple was first to stake its claim for the Simple namespace. It is perhaps simple in implementation but not so simple in use except in the most trivial of cases. If you want something similar, then XML::Smart offers a nested data-structure API but does it a lot better.

Thankfully there is a lot of choice for excellent Perl XML modules. XML::Twig is one of these, and it allows you to specify callback subroutines that will be executed when specific elements within the XML data are encountered during parsing.

This program uses XML::Twig, and sets a callback on coreference[mention], i.e. coreference elements that have at least one mention child.

The code in the handler subroutine makes no checks and assumes that there will always be at least two mention child elements, each with a sentence and a header element. The text values of these nodes are output in the format you have described.

use strict;
use warnings;

use XML::Twig;

my $twig = XML::Twig->new(twig_handlers => {
  'coreference[mention]' => \&handle_coreference
});
$twig->parsefile('myxml.xml');

my $n;
sub handle_coreference {

  my ($twig, $elt) = @_;

  my @mentions = $elt->children('mention');

  for my $i (0 .. $#mentions) {
    printf "%s sentence %d, head %d\n",
      $i == 0 ? sprintf '%3d.', ++$n : '    ',
      map $mentions[$i]->first_child_trimmed_text($_), qw/ sentence head /;
  }
}

output

  1. sentence 1, head 1
     sentence 1, head 33
  2. sentence 1, head 8
     sentence 1, head 10

edited Apr 8, 2013 at 22:39

answered Apr 8, 2013 at 22:28

Borodin

127k9 gold badges72 silver badges146 bronze badges

3 Comments

user2154731 Over a year ago

Thanks a lot! It works. I was stuck with this problem for so many days. Really appreciate your help.

ikegami Over a year ago

I doubt there's always going to be exactly two mentions.

Borodin Over a year ago

@ikegami: Unfortunately we know nothing at all about this data. Nevertheless I have generalised the solution as it is also neater that way.

runrig · Accepted Answer · 2013-04-08 21:52:24Z

0

Something like:

use strict;
use warnings;

use XML::Rules;

my $mention_cnt;
my $ref_cnt = 1;
my @rules = (
  coreference => sub {
    $ref_cnt++ if $mention_cnt;
    $mention_cnt = 0;
  },
  mention => sub {
    my $d = $_[1];
    my $str = $mention_cnt++ ? " " x 6 : sprintf("%-6s", "$ref_cnt.");
    print "$str sentence: $d->{sentence} head: $d->{head}\n";
  },
  'sentence,head' => 'content',
);

my $xr = XML::Rules->new(
  rules => \@rules,
);
$xr->parse($xml);

edited Apr 8, 2013 at 21:52

answered Apr 8, 2013 at 21:28

runrig

6,5442 gold badges31 silver badges46 bronze badges

5 Comments

ikegami Over a year ago

Note the desired output now that proper formatting has been applied.

runrig Over a year ago

@ikegami - Oh, well, he did say 'something like...' :-) Maybe I'll update, maybe I won't...or anyone w/editing powers can update this answer...

user2154731 Over a year ago

I am new to perl and still trying to understand the steps involved in parsing a file in addition to a lot of other things. Would really appreciate if you would update it.

user2154731 Over a year ago

Thanks a lot. One quick question, where is the input provided? Is it the $xml variable? I keep getting a syntax error on that line.

runrig Over a year ago

Yes, assuming $xml is a string of XML, as the XML can't just spontaneously create itself. Or it could come from a filehandle, or using parse_file() instead, from a file.

Collectives™ on Stack Overflow

Extract attributes and values from XML file in perl

3 Answers 3

3 Comments

3 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

3 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related