1

I need your help to learn the xml/text format files extraction my xml/txt files contain's a huge data as below mentioned format.

<authorList>
<author>
<fullName>Oliver LA</fullName>
<firstName>L A</firstName>
<lastName>Oliver</lastName>
<initials>LA</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>University of Liverpool, Liverpool, UK. Electronic address: [email protected].</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
<author>
<fullName>Hutton DP</fullName>
<firstName>D P</firstName>
<lastName>Hutton</lastName>
<initials>DP</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>North West Radiotherapy Operational Delivery Network, The Christie Hospital, Manchester, UK; University of Liverpool, Liverpool, UK.</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
<author>
<fullName>Hall T</fullName>
<firstName>T</firstName>
<lastName>Hall</lastName>
<initials>T</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>North West Radiotherapy Operational Delivery Network, The Christie Hospital, Manchester, UK.</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
<author>
<fullName>Cain M</fullName>
<firstName>M</firstName>
<lastName>Cain</lastName>
<initials>M</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>Clatterbridge Cancer Centre, Liverpool, UK.</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
<author>
<fullName>Bates M</fullName>
<firstName>M</firstName>
<lastName>Bates</lastName>
<initials>M</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>East of England Radiotherapy Network, Norfolk &amp; Norwich University Hospital, Norwich, UK.</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
<author>
<fullName>Cree A</fullName>
<firstName>A</firstName>
<lastName>Cree</lastName>
<initials>A</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>Clatterbridge Cancer Centre, Liverpool, UK.</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
<author>
<fullName>Mullen E</fullName>
<firstName>E</firstName>
<lastName>Mullen</lastName>
<initials>E</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>Clatterbridge Cancer Centre, Liverpool, UK.</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
</authorList>

I need the output format like Email,firstName,lastname,affiliation and the output should be exported to a text file.

By using Perl software, I have developed a code which is mentioned below.

#!usr/bin/perl
use strict;
use warnings;
open(FILEHANDLE, "<data.xml")|| die "Can't open";
my @line;
my @affi;

my @lines;
my $ct =1 ;
print "Enter the start position:-";

my $start= <STDIN>;
print "Enter the end position:-";


my $end = <STDIN>;

print "Processing your data...\n";
my $i =0;
my $t =0;
while(<FILEHANDLE>)
{
    if($ct>$end)
    {
       close(FILEHANDLE);
       exit;
       
    }
    if($ct>=$start)
    {
       $lines[$t] = $_;
       $t++;
     }
     
     if($ct == $end)
     {
    my $i = 0;
    my $j = 0;
    my @last;
    my @first;
    my $l = @lines;
    my $s = 0;

while($j<$l)
{
    if ($lines[$j] =~m/@/)
    {
        $line[$i] = $lines[$j];
        $s = $j-3;
        $first[$i]=$lines[$s]; 
        $s--;
        $last[$i] = $lines[$s];
        #$j = $j+3;
        #$last[$i]= $lines[$j];
        #$j++;
        #$first[$i] = $lines[$j];
        $i++;
    }
$j++;
}
my $k = 0;
foreach(@line)
{
  $line[$k] =~ s/<.*>(.* )(.*@.*)<.*>/$2/;
  $affi[$k] = $1;
  $line[$k] = $2;
    $line[$k] =~ s/\.$//;
    
    
    $k++;
  }

my $u = 0;
foreach(@first)
{
  $first[$u] =~s/<firstName>(.*)<.*>/$1/;
  $first[$u]=$l;  
  $u++
  }
my $m = 0;
foreach(@last)
{
  $last[$m] =~s/<lastName>(.*)<.*>/$1/;
  $last[$m] = $1;    
  $m++
  }
my $q=@line;
open(FILE,">RAVI.txt")|| die "can't open";
my $p;

for($p =0; $p<$q; $p++)
{  
  print FILE "$line[$p],$first[$p],$last[$p],$affi[$p]\n";
} 

close(FILE);
     }
     
  
  $ct++;
  }

By using this code I am able to get output email, ,lastname,affiliation format.

I am not able to get the firstName by using the code from the given data. I am new to the Perl technology. I request you to please help me by fixing the mistakes in my code. Thank you in advance.

1

3 Answers 3

4

As I said in comment, better use a known XML parser. One of them is XML::XPath:

#!/usr/bin/perl
use strict; use warnings;
use feature qw/say/;
use XML::XPath;

my $file = shift or die $!;
my $xp = XML::XPath->new(filename => $file);

my $nodeset = $xp->find('/authorList//author');

foreach my $node ($nodeset->get_nodelist) {
    my @contact;
    push @contact, $node->findvalue('./firstName');
    push @contact, $node->findvalue('./lastName');
    $_ = $node->findvalue('.//authorAffiliation/affiliation');
    push @contact, $& if m/\b\S+\@\S+/;
    say join ", ", @contact;
}

Output

L A, Oliver, [email protected].
D P, Hutton
T, Hall
M, Cain
M, Bates
A, Cree
E, Mullen

Usage

./XML::XPath.pl file.xml | tee new_file.txt
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for your valuable answer, Sir. And, the output should be like L A, Oliver, [email protected]. format and the output should be saved to a .txt file Could you please help me by adding the few more lines to the code
Post edited accordingly...
2

Your mistake was to try and write your own XML parser. That's a very hard thing to get right. Far better to use one that has already been written.

I always reach for XML::LibXML (it has terrible documentation, but there's a great tutorial online).

A first attempt at your program would look something like this:

#!/usr/bin/perl

use strict;
use warnings;

use feature 'say';

use XML::LibXML;

my $infile = shift
  or die "Usage: $0 xml_file\n";

my $dom = XML::LibXML->load_xml(location => $infile);

my @nodes = qw[ firstName lastName
                authorAffiliationDetailsList/authorAffiliation/affiliation ];

for my $author ($dom->findnodes('//author')) {
  my @data = map { $author->findvalue($_) } @nodes;

  say join ',', map { qq["$_"] } @data;
}

Note that I've put all of your output into quotes - that's because the affiliation node contains embedded commas.

In reality, you'd need to process the affiliation data a little more to extract the email address. But I hope this gets you most of the way to a solution.

5 Comments

Thank you for your valuable answer, Sir. But the output should be like L A, Oliver, [email protected]. format and the output should be saved to a .txt file Could you please help me by adding the few more lines to the code
@ravi.gteja: Honestly, not really. I've pointed you in the right direction and (to be honest) have done all the interesting work. I'm not interested in producing a complete solution for you. I've given you the framework and a pointer to a really good tutorial - the rest is up to you.
@GillesQuenot: i suspect that's pretty much always true of Perl code :-)
TMTOWTDI........
0

While I, too, would recommend using an XML parser and not using Regexes (unless you're that Damian!) we haven't told you what is wrong with your code.

Dump your data - using say Data::Dumper.

...
    $k++;
  }
use Data::Dumper;
warn Data::Dumper->new([\@line,\@first,\@last],[qw(*line *first *last)])->Deepcopy(1)->Indent(1)->Maxdepth(3)->Sortkeys(1)->Dump(),q{ };

my $u = 0;
foreach(@first)
{
...

And you will find that

@line = (
  '<affiliation>University of Liverpool, Liverpool, UK. Electronic address: [email protected].</affiliation>
'
);
@first = (
  '<initials>LA</initials>
'
);
@last = (
  '<lastName>Oliver</lastName>
'
);

@first isn't '<firstName>...</firstName>' but '<initials>LA</initials>' which is why your first name regexp never returns the expected value.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.