1

input text file contain the following:

....    
    ponies B-pro        
    were I-pro        
    used I-pro    
    A O        
    report O        
    of O    
    indirect B-cd        
    were O
    . O    
...

output XML file

<sen> 
 <base id="pro">
  <w id="1">ponies</w>
  <w id="2">were</w>
  <w id="3">were</w>
 </base>A report of 
 <base id="cd">indirect</base> were 
</sen>

i want to make an XML file by reading the text file, B- means the begining of my tag and I- means an include words inside the tag while "O" means outside the base tag which means it only exist in the tag.

i try the following codes:

#!/usr/local/bin/perl -w    
open(my $f, "input.txt") or die "Can't";    
open(my $o, ">output.xml") or die "Can't";    
my $c;   

sub read_line {     
  my $fh = shift;    
  if ($fh and my $line = <$fh>) {    
    chomp($line);    
 my @words = split(/\t/, $line);    
 my $word = $words[0];
     my $group = $words[1];    
 if($word eq "."){    
  return;    
 }    
 else{    
  if($group ne 'O'){    
   my @b = split(/\-/, $group);    
   if($b[0] eq 'B'){    
    my $e = "<e id=\"";              
    $e .= " . $b[1] . "\">";    
    $e .= $word . "</e>";
    return $e;    
   }   
   if($b[0] eq 'I'){    
    my $w = "<w id=\"";    
    $w .= $c . "\">";    
    $w .= $word . "</w>";    
    $c++;    
    return $w;    
   }    
  }    
  else{    
   $c = 2;    
   return $word;    
  }    
 }    
  }    
  return;    
}

sub get_text(){    
 my $txt = "";    
 my $r = read_line($f);     
 while($r){     
  if($r =~ m/[[:punct:]]/){    
   chop($txt);    
   $txt .= " " . $r . " ";    
  }    
  else{    
   $txt .= $r . " ";    
  }    
  $r = read_line($f);    
 }   
 chop($txt);    
 return "<sen>" . $txt . ".</sen>";    
}

instead im getting as output:

<sen> 
 <base id="pro"> ponies </base>
  <w id="2">were</w>
  <w id="3">were</w>
 A report of 
 <base id="cd">indirect</base> were 
</sen>

i really need help.

Thanks

4
  • 3
    Don't try to generate XML by bashing strings together. Use a proper XML module. Commented Dec 6, 2010 at 23:05
  • There's a bunch of ambiguities in your question -- is indirect really supposed to be text directly inside the <base id="cd"> instead of getting a <w>? Do <w> IDs just increment globally? (XML forbids reusing an ID). What happens if we see blah I-foo immediately after blah B-bar (the base ID doesn't match)? I have some working code but I can't really say that it's right without answers to these questions. Commented Dec 7, 2010 at 2:40
  • show it to me, may be i can get some ideas. Thanks Commented Dec 7, 2010 at 10:08
  • word IDs increament only the next word after 'B-' has 'I-' and reset to 2 if word has 'O' in the second column. and whenever i found 'B-' immediately i should reset the counter. Commented Dec 7, 2010 at 10:53

2 Answers 2

1

Writing XML "by hand" will only get you in trouble. Use a module from CPAN.

In your case, I would first put the data in a proper Perl data structure (maybe a hash containing some arrays, or something similar) and then using a module (i.e. XML::Simple for starters) to output to a file.

Sign up to request clarification or add additional context in comments.

1 Comment

XML::Simple would not work in this case as the output includes mixed content
1

As Javs said, you want to use a module rather than do this by hand. For your purposes, since you have mixed content, I recommend XML::LibXML. Here is an example I made to test that you can indeed to mixed content like you've got:

use XML::LibXML;

my $doc = XML::LibXML::Document->new();

my $root = $doc->createElement('html');
$doc->setDocumentElement($root);
my $body = $doc->createElement('body');
$root->appendChild($body);

my $link = $doc->createElement('a');
$link->setAttribute('href', 'http://google.com');
$link->appendText('Google');
$body->appendChild($link);

$body->appendText('Inline Text');

print $doc->toString;

2 Comments

Thanks alot, this really helps. do you have an idea how i can detect if my next word in the text file have an I- suffix or have 'O'.
You might try using Regular Expressions with a lookahead.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.