Extracting data from XML / Text file using Perl software

Question

I need your help to learn the xml/text format files extraction my xml/txt files contain's a huge data as below mentioned format.

<authorList>
<author>
<fullName>Oliver LA</fullName>
<firstName>L A</firstName>
<lastName>Oliver</lastName>
<initials>LA</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>University of Liverpool, Liverpool, UK. Electronic address: [email protected].</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
<author>
<fullName>Hutton DP</fullName>
<firstName>D P</firstName>
<lastName>Hutton</lastName>
<initials>DP</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>North West Radiotherapy Operational Delivery Network, The Christie Hospital, Manchester, UK; University of Liverpool, Liverpool, UK.</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
<author>
<fullName>Hall T</fullName>
<firstName>T</firstName>
<lastName>Hall</lastName>
<initials>T</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>North West Radiotherapy Operational Delivery Network, The Christie Hospital, Manchester, UK.</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
<author>
<fullName>Cain M</fullName>
<firstName>M</firstName>
<lastName>Cain</lastName>
<initials>M</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>Clatterbridge Cancer Centre, Liverpool, UK.</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
<author>
<fullName>Bates M</fullName>
<firstName>M</firstName>
<lastName>Bates</lastName>
<initials>M</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>East of England Radiotherapy Network, Norfolk &amp; Norwich University Hospital, Norwich, UK.</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
<author>
<fullName>Cree A</fullName>
<firstName>A</firstName>
<lastName>Cree</lastName>
<initials>A</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>Clatterbridge Cancer Centre, Liverpool, UK.</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
<author>
<fullName>Mullen E</fullName>
<firstName>E</firstName>
<lastName>Mullen</lastName>
<initials>E</initials>
<authorAffiliationDetailsList>
<authorAffiliation>
<affiliation>Clatterbridge Cancer Centre, Liverpool, UK.</affiliation>
</authorAffiliation>
</authorAffiliationDetailsList>
</author>
</authorList>

I need the output format like Email,firstName,lastname,affiliation and the output should be exported to a text file.

By using Perl software, I have developed a code which is mentioned below.

#!usr/bin/perl
use strict;
use warnings;
open(FILEHANDLE, "<data.xml")|| die "Can't open";
my @line;
my @affi;

my @lines;
my $ct =1 ;
print "Enter the start position:-";

my $start= <STDIN>;
print "Enter the end position:-";


my $end = <STDIN>;

print "Processing your data...\n";
my $i =0;
my $t =0;
while(<FILEHANDLE>)
{
    if($ct>$end)
    {
       close(FILEHANDLE);
       exit;
       
    }
    if($ct>=$start)
    {
       $lines[$t] = $_;
       $t++;
     }
     
     if($ct == $end)
     {
    my $i = 0;
    my $j = 0;
    my @last;
    my @first;
    my $l = @lines;
    my $s = 0;

while($j<$l)
{
    if ($lines[$j] =~m/@/)
    {
        $line[$i] = $lines[$j];
        $s = $j-3;
        $first[$i]=$lines[$s]; 
        $s--;
        $last[$i] = $lines[$s];
        #$j = $j+3;
        #$last[$i]= $lines[$j];
        #$j++;
        #$first[$i] = $lines[$j];
        $i++;
    }
$j++;
}
my $k = 0;
foreach(@line)
{
  $line[$k] =~ s/<.*>(.* )(.*@.*)<.*>/$2/;
  $affi[$k] = $1;
  $line[$k] = $2;
    $line[$k] =~ s/\.$//;
    
    
    $k++;
  }

my $u = 0;
foreach(@first)
{
  $first[$u] =~s/<firstName>(.*)<.*>/$1/;
  $first[$u]=$l;  
  $u++
  }
my $m = 0;
foreach(@last)
{
  $last[$m] =~s/<lastName>(.*)<.*>/$1/;
  $last[$m] = $1;    
  $m++
  }
my $q=@line;
open(FILE,">RAVI.txt")|| die "can't open";
my $p;

for($p =0; $p<$q; $p++)
{  
  print FILE "$line[$p],$first[$p],$last[$p],$affi[$p]\n";
} 

close(FILE);
     }
     
  
  $ct++;
  }

By using this code I am able to get output email, ,lastname,affiliation format.

I am not able to get the firstName by using the code from the given data. I am new to the Perl technology. I request you to please help me by fixing the mistakes in my code. Thank you in advance.

Better use metacpan.org/pod/XML::XPath

Gilles Quénot
– Gilles Quénot

2022-12-29 10:51:37 +00:00
Commented Dec 29, 2022 at 10:51 — Gilles Quénot
– Gilles Quénot, Commented Dec 29, 2022 at 10:51

Gilles Quénot · Accepted Answer · 2023-01-21 11:48:18Z

4

As I said in comment, better use a known XML parser. One of them is XML::XPath:

#!/usr/bin/perl
use strict; use warnings;
use feature qw/say/;
use XML::XPath;

my $file = shift or die $!;
my $xp = XML::XPath->new(filename => $file);

my $nodeset = $xp->find('/authorList//author');

foreach my $node ($nodeset->get_nodelist) {
    my @contact;
    push @contact, $node->findvalue('./firstName');
    push @contact, $node->findvalue('./lastName');
    $_ = $node->findvalue('.//authorAffiliation/affiliation');
    push @contact, $& if m/\b\S+\@\S+/;
    say join ", ", @contact;
}

Output

L A, Oliver, [email protected].
D P, Hutton
T, Hall
M, Cain
M, Bates
A, Cree
E, Mullen

Usage

./XML::XPath.pl file.xml | tee new_file.txt

edited Jan 21, 2023 at 11:48

answered Dec 29, 2022 at 11:46

Gilles Quénot

188k43 gold badges232 silver badges229 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ravi.g teja Over a year ago

Thank you for your valuable answer, Sir. And, the output should be like L A, Oliver, [email protected]. format and the output should be saved to a .txt file Could you please help me by adding the few more lines to the code

Gilles Quénot Over a year ago

Post edited accordingly...

Gilles Quénot · Accepted Answer · 2022-12-29 12:32:41Z

2

Your mistake was to try and write your own XML parser. That's a very hard thing to get right. Far better to use one that has already been written.

I always reach for XML::LibXML (it has terrible documentation, but there's a great tutorial online).

A first attempt at your program would look something like this:

#!/usr/bin/perl

use strict;
use warnings;

use feature 'say';

use XML::LibXML;

my $infile = shift
  or die "Usage: $0 xml_file\n";

my $dom = XML::LibXML->load_xml(location => $infile);

my @nodes = qw[ firstName lastName
                authorAffiliationDetailsList/authorAffiliation/affiliation ];

for my $author ($dom->findnodes('//author')) {
  my @data = map { $author->findvalue($_) } @nodes;

  say join ',', map { qq["$_"] } @data;
}

Note that I've put all of your output into quotes - that's because the affiliation node contains embedded commas.

In reality, you'd need to process the affiliation data a little more to extract the email address. But I hope this gets you most of the way to a solution.

edited Dec 29, 2022 at 12:32

Gilles Quénot

188k43 gold badges232 silver badges229 bronze badges

answered Dec 29, 2022 at 12:26

Dave Cross

69.5k3 gold badges55 silver badges101 bronze badges

5 Comments

ravi.g teja Over a year ago

Thank you for your valuable answer, Sir. But the output should be like L A, Oliver, [email protected]. format and the output should be saved to a .txt file Could you please help me by adding the few more lines to the code

Dave Cross Over a year ago

@ravi.gteja: Honestly, not really. I've pointed you in the right direction and (to be honest) have done all the interesting work. I'm not interested in producing a complete solution for you. I've given you the framework and a pointer to a really good tutorial - the rest is up to you.

Gilles Quénot Over a year ago

Can be shorter: hastebin.com/gacedebati.pl

Dave Cross Over a year ago

@GillesQuenot: i suspect that's pretty much always true of Perl code :-)

Gilles Quénot Over a year ago

TMTOWTDI........

user3343917 · Accepted Answer · 2023-08-01 16:48:58Z

While I, too, would recommend using an XML parser and not using Regexes (unless you're that Damian!) we haven't told you what is wrong with your code.

Dump your data - using say Data::Dumper.

...
    $k++;
  }
use Data::Dumper;
warn Data::Dumper->new([\@line,\@first,\@last],[qw(*line *first *last)])->Deepcopy(1)->Indent(1)->Maxdepth(3)->Sortkeys(1)->Dump(),q{ };

my $u = 0;
foreach(@first)
{
...

And you will find that

@line = (
  '<affiliation>University of Liverpool, Liverpool, UK. Electronic address: [email protected].</affiliation>
'
);
@first = (
  '<initials>LA</initials>
'
);
@last = (
  '<lastName>Oliver</lastName>
'
);

@first isn't '<firstName>...</firstName>' but '<initials>LA</initials>' which is why your first name regexp never returns the expected value.

Collectives™ on Stack Overflow

Extracting data from XML / Text file using Perl software

3 Answers 3

Output

Usage

2 Comments

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Output

Usage

2 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related