0

This is part of the output XML file I get as output from Stanford CoreNLP:

<collapsed-ccprocessed-dependencies>  
      <dep type="nn">
        <governor idx="25">Mullen</governor>
        <dependent idx="24">Ms.</dependent>
      </dep>
      <dep type="nsubj">
        <governor idx="26">said</governor>
        <dependent idx="25">Mullen</dependent>
      </dep>
    </collapsed-ccprocessed-dependencies>
  </sentence>
</sentences>
<coreference>
  <coreference>
    <mention representative="true">
      <sentence>1</sentence>
      <start>1</start>
      <end>2</end>
      <head>1</head>
    </mention>
    <mention>
      <sentence>1</sentence>
      <start>33</start>
      <end>34</end>
      <head>33</head>
    </mention>
  </coreference>
 </coreference>
<mention representative="true">
      <sentence>1</sentence>
      <start>6</start>
      <end>9</end>
      <head>8</head>
    </mention>
    <mention>
      <sentence>1</sentence>
      <start>10</start>
      <end>11</end>
      <head>10</head>
    </mention>
  </coreference>
  <coreference>   

How do I parse it using Perl so that I get something like this:

1. sentence 1, head 1
   sentence 1, head 33
2. sentence 1, head 8
   sentence 1, head 10

I have tried with XML::Simple but the output is not easily understandable. Here is what I did: use XML::Simple; use Data::Dumper;

$outfile = $filename.".xml";
$xml = new XML::Simple;

$data = $xml -> XMLin($outfile);
print Dumper($data);
1
  • 1
    You're going to have to show what you've tried so far. Commented Apr 8, 2013 at 21:24

3 Answers 3

4

XML::Simple has the hardest interface to use. You could use something like

use XML::LibXML qw( );

my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($xml);

my $coref_count;
for my $coref_node ($doc->findnodes('//coreference/coreference')) {
   ++$coref_count;

   my $mention_count;
   for my $mention_node ($coref_node->findnodes('mention')) {
      ++$mention_count;

      my $sentence = $mention_node->findvalue('sentence/text()');
      my $head     = $mention_node->findvalue('head/text()');

      my $prefix = "$coref_count.";
      $prefix = ' ' x length($prefix) if $mention_count == 1;

      print "$prefix sentence $sentence, head $head\n";
   }
}
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks. I keep getting a syntax error saying that the '<' tag is missing. So, I guess I am making a mistake in providing the input. I provided the filename as input. Could you guide me as to where I am going wrong?
You can use ->parse_file($qfn) if you have a file
oh, I was using it as -> parsefile probably didn't work because of that. Thanks again!
2

Regrettably, XML::Simple was first to stake its claim for the Simple namespace. It is perhaps simple in implementation but not so simple in use except in the most trivial of cases. If you want something similar, then XML::Smart offers a nested data-structure API but does it a lot better.

Thankfully there is a lot of choice for excellent Perl XML modules. XML::Twig is one of these, and it allows you to specify callback subroutines that will be executed when specific elements within the XML data are encountered during parsing.

This program uses XML::Twig, and sets a callback on coreference[mention], i.e. coreference elements that have at least one mention child.

The code in the handler subroutine makes no checks and assumes that there will always be at least two mention child elements, each with a sentence and a header element. The text values of these nodes are output in the format you have described.

use strict;
use warnings;

use XML::Twig;

my $twig = XML::Twig->new(twig_handlers => {
  'coreference[mention]' => \&handle_coreference
});
$twig->parsefile('myxml.xml');

my $n;
sub handle_coreference {

  my ($twig, $elt) = @_;

  my @mentions = $elt->children('mention');

  for my $i (0 .. $#mentions) {
    printf "%s sentence %d, head %d\n",
      $i == 0 ? sprintf '%3d.', ++$n : '    ',
      map $mentions[$i]->first_child_trimmed_text($_), qw/ sentence head /;
  }
}

output

  1. sentence 1, head 1
     sentence 1, head 33
  2. sentence 1, head 8
     sentence 1, head 10

3 Comments

Thanks a lot! It works. I was stuck with this problem for so many days. Really appreciate your help.
I doubt there's always going to be exactly two mentions.
@ikegami: Unfortunately we know nothing at all about this data. Nevertheless I have generalised the solution as it is also neater that way.
0

Something like:

use strict;
use warnings;

use XML::Rules;

my $mention_cnt;
my $ref_cnt = 1;
my @rules = (
  coreference => sub {
    $ref_cnt++ if $mention_cnt;
    $mention_cnt = 0;
  },
  mention => sub {
    my $d = $_[1];
    my $str = $mention_cnt++ ? " " x 6 : sprintf("%-6s", "$ref_cnt.");
    print "$str sentence: $d->{sentence} head: $d->{head}\n";
  },
  'sentence,head' => 'content',
);

my $xr = XML::Rules->new(
  rules => \@rules,
);
$xr->parse($xml);

5 Comments

Note the desired output now that proper formatting has been applied.
@ikegami - Oh, well, he did say 'something like...' :-) Maybe I'll update, maybe I won't...or anyone w/editing powers can update this answer...
I am new to perl and still trying to understand the steps involved in parsing a file in addition to a lot of other things. Would really appreciate if you would update it.
Thanks a lot. One quick question, where is the input provided? Is it the $xml variable? I keep getting a syntax error on that line.
Yes, assuming $xml is a string of XML, as the XML can't just spontaneously create itself. Or it could come from a filehandle, or using parse_file() instead, from a file.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.