How can I use Perl regular expressions to parse XML data?

Question

I have a pretty long piece of XML that I want to parse. I want to remove everything except for the subclass-code and city. So that I am left with something like the example below.

EXAMPLE

TEST SUBCLASS|MIAMI

CODE

<?xml version="1.0" standalone="no"?>  
<web-export>  
<run-date>06/01/2010  
<pub-code>TEST  
<ad-type>TEST  
<cat-code>Real Estate</cat-code>  
<class-code>TEST</class-code>  
<subclass-code>TEST SUBCLASS</subclass-code>  
<placement-description></placement-description>  
<position-description>Town House</position-description>  
<subclass3-code></subclass3-code>  
<subclass4-code></subclass4-code>  
<ad-number>0000284708-01</ad-number>  
<start-date>05/28/2010</start-date>  
<end-date>06/09/2010</end-date>  
<line-count>6</line-count>  
<run-count>13</run-count>  
<customer-type>Private Party</customer-type>  
<account-number>100099237</account-number>  
<account-name>DOE, JOHN</account-name>  
<addr-1>207 CLARENCE STREET</addr-1>  
<addr-2> </addr-2>  
<city>MIAMI</city>  
<state>FL</state>  
<postal-code>02910</postal-code>  
<country>USA</country>  
<phone-number>4014612880</phone-number>  
<fax-number></fax-number>  
<url-addr> </url-addr>  
<email-addr>[email protected]</email-addr>  
<pay-flag>N</pay-flag>  
<ad-description>DEANESTATES2BEDS2BATHSAPPLIANCED</ad-description>  
<order-source>Import</order-source>  
<order-status>Live</order-status>  
<payor-acct>100099237</payor-acct>  
<agency-flag>N</agency-flag>  
<rate-note></rate-note>  
<ad-content> MIAMI&#47;Dean Estates&#58; 2 
beds&#44; 2 baths&#46; Applianced&#46; Central air&#46; Carpets&#46; Laundry&#46; 2 decks&#46; Pool&#46; Parking&#46; Close to everything&#46;No smoking&#46; No utilities&#46; &#36;1275 mo&#46; 401&#45;578&#45;1501&#46;  </ad-content>  
</ad-type>  
</pub-code>  
</run-date>  
</web-export>

PERL

So what I want to do is open an existing file read the contents then use regular expressions to eliminate the unnecessary XML tags.

open(READFILE, "FILENAME");  
while(<READFILE>)  
{  
$_ =~ s/<\?xml version="(.*)" standalone="(.*)"\?>\n.*//g;  
    $_ =~ s/<subclass-code>//g;  
    $_ =~ s/<\/subclass-code>\n.*/|/g;  
    $_ =~ s/(.*)PJ RER Houses /PJ RER Houses/g;  
    $_ =~ s/\G //g;  
    $_ =~ s/<city>//g; 
    $_ =~ s/<\/city>\n.*//g;  
    $_ =~ s/<(\/?)web-export>(.*)\n.*//g;  
    $_ =~ s/<(\/?)run-date>(.*)\n.*//g;  
    $_ =~ s/<(\/?)pub-code>(.*)\n.*//g;  
    $_ =~ s/<(\/?)ad-type>(.*)\n.*//g;  
    $_ =~ s/<(\/?)cat-code>(.*)<(\/?)cat-code>\n.*//g;  
    $_ =~ s/<(\/?)class-code>(.*)<(\/?)class-code>\n.*//g;  
    $_ =~ s/<(\/?)placement-description>(.*)<(\/?)placement-description>\n.*//g;  
    $_ =~ s/<(\/?)position-description>(.*)<(\/?)position-description>\n.*//g;  
    $_ =~ s/<(\/?)subclass3-code>(.*)<(\/?)subclass3-code>\n.*//g;  
    $_ =~ s/<(\/?)subclass4-code>(.*)<(\/?)subclass4-code>\n.*//g;  
    $_ =~ s/<(\/?)ad-number>(.*)<(\/?)ad-number>\n.*//g;  
    $_ =~ s/<(\/?)start-date>(.*)<(\/?)start-date>\n.*//g;  
    $_ =~ s/<(\/?)end-date>(.*)<(\/?)end-date>\n.*//g;  
    $_ =~ s/<(\/?)line-count>(.*)<(\/?)line-count>\n.*//g;  
    $_ =~ s/<(\/?)run-count>(.*)<(\/?)run-count>\n.*//g;  
    $_ =~ s/<(\/?)customer-type>(.*)<(\/?)customer-type>\n.*//g;  
    $_ =~ s/<(\/?)account-number>(.*)<(\/?)account-number>\n.*//g;  
    $_ =~ s/<(\/?)account-name>(.*)<(\/?)account-name>\n.*//g;  
    $_ =~ s/<(\/?)addr-1>(.*)<(\/?)addr-1>\n.*//g;  
    $_ =~ s/<(\/?)addr-2>(.*)<(\/?)addr-2>\n.*//g;  
    $_ =~ s/<(\/?)state>(.*)<(\/?)state>\n.*//g;  
    $_ =~ s/<(\/?)postal-code>(.*)<(\/?)postal-code>\n.*//g;  
    $_ =~ s/<(\/?)country>(.*)<(\/?)country>\n.*//g;  
    $_ =~ s/<(\/?)phone-number>(.*)<(\/?)phone-number>\n.*//g;  
    $_ =~ s/<(\/?)fax-number>(.*)<(\/?)fax-number>\n.*//g;  
    $_ =~ s/<(\/?)url-addr>(.*)<(\/?)url-addr>\n.*//g;  
    $_ =~ s/<(\/?)email-addr>(.*)<(\/?)email-addr>\n.*//g;  
    $_ =~ s/<(\/?)pay-flag>(.*)<(\/?)pay-flag>\n.*//g;  
    $_ =~ s/<(\/?)ad-description>(.*)<(\/?)ad-description>\n.*//g;  
    $_ =~ s/<(\/?)order-source>(.*)<(\/?)order-source>\n.*//g;  
    $_ =~ s/<(\/?)order-status>(.*)<(\/?)order-status>\n.*//g;  
    $_ =~ s/<(\/?)payor-acct>(.*)<(\/?)payor-acct>\n.*//g;  
    $_ =~ s/<(\/?)agency-flag>(.*)<(\/?)agency-flag>\n.*//g;  
    $_ =~ s/<(\/?)rate-note>(.*)<(\/?)rate-note>\n.*//g;  
    $_ =~ s/<ad-content>(.*)\n.*//g;  
    $_ =~ s/\t(.*)\n.*//g;  
    $_ =~ s/<\/ad-content>(.*)\n.*//g;  
}  
close( READFILE1 );

Is there an easier way of doing this? I don't want to use any modules. I know that it might make this easier but the file I am reading has a lot of data in it.

also, s/// binds to $_ by default, the $_ =~ is totally noise. — Evan Carroll
– Evan Carroll, Commented Jun 1, 2010 at 16:48
First of all, regular expressions can't parse XML. Second, using a module has nothing to do with the amount of data you want to process; you will most likely get much better performance using a module instead of rolling your own. — Svante
– Svante, Commented Jun 1, 2010 at 19:15
Why are people so scared about using modules? They are much likely more tested and optimized than any code you can write yourself. — Matteo Riva
– Matteo Riva, Commented Jun 2, 2010 at 17:41

Community · Accepted Answer · 2017-05-23 12:02:23Z

12

This is horrible (sorry). Regular expressions are not necessarily faster even if you have a lot of data.

Why not use XSLT?

Your stylesheet would basically look like this (if you have only one subclass-code and city element):

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="text" />  

    <xsl:template match="/">
        <xsl:apply-templates select="//subclass-code|//city" />
    </xsl:template>

    <xsl:template match="subclass-code">
       <xsl:value-of select="." /><xsl:text> | </xsl:text>
    </xsl:template>

    <xsl:template match="city">
       <xsl:value-of select="." /><xsl:text>  </xsl:text>
    </xsl:template>
</xsl:stylesheet>

(Updated the code to work with multiple elements. Might not be the best solution ;))

edited May 23, 2017 at 12:02

CommunityBot

11 silver badge

answered Jun 1, 2010 at 14:42

Felix Kling

820k181 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

BlairHippo Over a year ago

This. Why do you not want to use modules? Re-inventing the wheel is not only a lot of work, it tends to result in a really suck-ass wheel.

DVK Over a year ago

@Blair - I suggest you quickly file a patent on an idea of a "suck-ass wheel" before Apple or IBM beats you.

BlairHippo Over a year ago

@DVK: They can both offer so many examples of prior art that it scarcely seems worth the effort.

3 revs, 2 users 83% · Accepted Answer · 2010-06-03 14:55:29Z

7

Why wouldn't you use libraries if someone has already written efficient (and dare I say feature-rich) module like XML::Twig to parse XML?

use XML::Twig;

die "Usage: give-me-the-elements.pl <xml_file>\n" unless ($ARGV[0]);

my $twig = XML::Twig->new( twig_handlers => 
                             { 'subclass-code' => sub { print->text, "|"; }, 
                               'city' => sub { print $_->text, "\n"; }, 
                             },
                           pretty_print  => 'indented');

$twig->parsefile($ARGV[0]); 
$twig->purge;

edited Jun 3, 2010 at 14:55

community wiki

3 revs, 2 users 83%
Susheel Javadi

Comments

leonbloy · Accepted Answer · 2010-06-01 15:31:17Z

5

If you need a general XML parsing method, don't use regex. If you just need what you said (remove everything except for the subclass-code and city) and if you are sure that those two tags will appear with no "strange" things inside (xml entities, CDATA sections) and that those tags will not appear inside other CDATA fragments, etc, you can simply do:

$/ = undef; # slurp mode
open(READFILE, "FILENAME");
$t = <READFILE>;
close READFILE;
$t =~ s#^.*<subclass-code>(.*?)</subclass-code>.*<city>(.*?)</city>.*$#$1 - $2#s;
# in case the tags could appear in distinct order - uncomment the following
# $t =~ s#^.*<city>(.*?)</city>.*<subclass-code>(.*?)</subclass-code>.*$#$2 - $1#s;
print $t;

Edit: A little more (ahem) powerful, following poster's requirements:

while( $t =~ m#<pub-code>([^<\s]*).*?<subclass-code>(.*?)</subclass-code>.*?<city>(.*?)</city>#sg) {
  print "$1 : $2 | $3 \n";
}

But please stop here and don't go further, this way leads to hell...

edited Jun 1, 2010 at 15:31

answered Jun 1, 2010 at 14:43

leonbloy

76.5k22 gold badges149 silver badges197 bronze badges

3 Comments

Luke Over a year ago

What if the file I am reading has more than one <pub-code> elements and wanted to display all of the results? <pub-code>PJ Projo.com <subclass-code>TEST SUBCLASS</subclass-code> <city>MIAMI</city> </pub-code> <pub-code>PJ Projo.com <subclass-code>TEST SUBCLASS</subclass-code> <city>ORLANDO</city> </pub-code> RESULT TEST SUBCLASS - MIAMI TEST SUBCLASS - ORLANDO

leonbloy Over a year ago

Ahh... can we at least assume that <subclass-code> and <city> always appear (in pairs, in that order)? The more general the problem, the less adequate an regex solution will be.

Luke Over a year ago

Yes, they always appear just like the original example. So its <subclass-code> then <city>, but all of the other elements with <pub-code> would still appear. Does that make sense?

Evan Carroll · Accepted Answer · 2010-06-01 16:08:32Z

5

The easy way of doing this would be to use XML::Simple in conjunction with a dumper (I like XXX, most use Data::Dumper. This will load the XML into a perl data structure where you can cherry pick the attributes you want (or don't want if you prefer to just explicitly delete).

Using the toolset I just suggested you can see a running example of what you want:

use strict;
use warnings;
use XML::Simple;

my $data = XML::Simple::parse_fh( \*DATA );       
my $sub = $data->{'run-date'}{'pub-code'}{'ad-type'};

foreach my $k ( keys %$sub ) {
  delete $sub->{$k}
    unless $k =~ /subclass-code|city/
  ; 
} 

use XXX;
XXX $data;

edited Jun 1, 2010 at 16:08

answered Jun 1, 2010 at 15:56

Evan Carroll

1

Comments

Kavet Kerek · Accepted Answer · 2010-06-02 17:38:07Z

1

Pay attention to what the other posters said, it is highly recommended to stay away from regex when parsing markup languages.

However, a pure perl way of accomplishing what you want without any modules and assuming the aforementioned tags do exist is:

my $reg_subclass = '\<city\>';
my $reg_city = '\<subclass\d*\-code\>';

open my $in, "input file";
open my $out, '>' ,"output file";
while ( my $line = <$in> ) {
    if ( $line =~ /$reg_subclass|$reg_city/ ) {
        print $out $line;
    }
}
close $in;
close $out;

answered Jun 2, 2010 at 17:38

Kavet Kerek

1,3059 silver badges24 bronze badges

Comments

BradTrim · Accepted Answer · 2010-06-01 19:26:46Z

0

I'm not an expert on what Perl supports, but generically, I think you want to use XPath here. (This might be what the Twig library above uses, I'm not sure).

Pseudo-Perl example (please excuse the crudeness; it's been a while since I really used Perl extensively):

$subclassExpr = "/web-export/subclass-code/text()";
$cityExpr = "/web-export/city/text()";

$domObject = xml_dom_parse( $xml_doc );

$subClass = xpath_evaluate( $domObject, $subclassExpr );
$subClass = xpath_evaluate( $domObject, $cityExpr );

answered Jun 1, 2010 at 19:26

BradTrim

562 bronze badges

Collectives™ on Stack Overflow

How can I use Perl regular expressions to parse XML data?

EXAMPLE

CODE

PERL

6 Answers 6

3 Comments

Comments

3 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

EXAMPLE

CODE

PERL

6 Answers 6

3 Comments

Comments

3 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related