0

I have a pretty long piece of XML that I want to parse. I want to remove everything except for the subclass-code and city. So that I am left with something like the example below.

EXAMPLE

TEST SUBCLASS|MIAMI

CODE

<?xml version="1.0" standalone="no"?>  
<web-export>  
<run-date>06/01/2010  
<pub-code>TEST  
<ad-type>TEST  
<cat-code>Real Estate</cat-code>  
<class-code>TEST</class-code>  
<subclass-code>TEST SUBCLASS</subclass-code>  
<placement-description></placement-description>  
<position-description>Town House</position-description>  
<subclass3-code></subclass3-code>  
<subclass4-code></subclass4-code>  
<ad-number>0000284708-01</ad-number>  
<start-date>05/28/2010</start-date>  
<end-date>06/09/2010</end-date>  
<line-count>6</line-count>  
<run-count>13</run-count>  
<customer-type>Private Party</customer-type>  
<account-number>100099237</account-number>  
<account-name>DOE, JOHN</account-name>  
<addr-1>207 CLARENCE STREET</addr-1>  
<addr-2> </addr-2>  
<city>MIAMI</city>  
<state>FL</state>  
<postal-code>02910</postal-code>  
<country>USA</country>  
<phone-number>4014612880</phone-number>  
<fax-number></fax-number>  
<url-addr> </url-addr>  
<email-addr>[email protected]</email-addr>  
<pay-flag>N</pay-flag>  
<ad-description>DEANESTATES2BEDS2BATHSAPPLIANCED</ad-description>  
<order-source>Import</order-source>  
<order-status>Live</order-status>  
<payor-acct>100099237</payor-acct>  
<agency-flag>N</agency-flag>  
<rate-note></rate-note>  
<ad-content> MIAMI&#47;Dean Estates&#58; 2 
beds&#44; 2 baths&#46; Applianced&#46; Central air&#46; Carpets&#46; Laundry&#46; 2 decks&#46; Pool&#46; Parking&#46; Close to everything&#46;No smoking&#46; No utilities&#46; &#36;1275 mo&#46; 401&#45;578&#45;1501&#46;  </ad-content>  
</ad-type>  
</pub-code>  
</run-date>  
</web-export>  

PERL

So what I want to do is open an existing file read the contents then use regular expressions to eliminate the unnecessary XML tags.

open(READFILE, "FILENAME");  
while(<READFILE>)  
{  
$_ =~ s/<\?xml version="(.*)" standalone="(.*)"\?>\n.*//g;  
    $_ =~ s/<subclass-code>//g;  
    $_ =~ s/<\/subclass-code>\n.*/|/g;  
    $_ =~ s/(.*)PJ RER Houses /PJ RER Houses/g;  
    $_ =~ s/\G //g;  
    $_ =~ s/<city>//g; 
    $_ =~ s/<\/city>\n.*//g;  
    $_ =~ s/<(\/?)web-export>(.*)\n.*//g;  
    $_ =~ s/<(\/?)run-date>(.*)\n.*//g;  
    $_ =~ s/<(\/?)pub-code>(.*)\n.*//g;  
    $_ =~ s/<(\/?)ad-type>(.*)\n.*//g;  
    $_ =~ s/<(\/?)cat-code>(.*)<(\/?)cat-code>\n.*//g;  
    $_ =~ s/<(\/?)class-code>(.*)<(\/?)class-code>\n.*//g;  
    $_ =~ s/<(\/?)placement-description>(.*)<(\/?)placement-description>\n.*//g;  
    $_ =~ s/<(\/?)position-description>(.*)<(\/?)position-description>\n.*//g;  
    $_ =~ s/<(\/?)subclass3-code>(.*)<(\/?)subclass3-code>\n.*//g;  
    $_ =~ s/<(\/?)subclass4-code>(.*)<(\/?)subclass4-code>\n.*//g;  
    $_ =~ s/<(\/?)ad-number>(.*)<(\/?)ad-number>\n.*//g;  
    $_ =~ s/<(\/?)start-date>(.*)<(\/?)start-date>\n.*//g;  
    $_ =~ s/<(\/?)end-date>(.*)<(\/?)end-date>\n.*//g;  
    $_ =~ s/<(\/?)line-count>(.*)<(\/?)line-count>\n.*//g;  
    $_ =~ s/<(\/?)run-count>(.*)<(\/?)run-count>\n.*//g;  
    $_ =~ s/<(\/?)customer-type>(.*)<(\/?)customer-type>\n.*//g;  
    $_ =~ s/<(\/?)account-number>(.*)<(\/?)account-number>\n.*//g;  
    $_ =~ s/<(\/?)account-name>(.*)<(\/?)account-name>\n.*//g;  
    $_ =~ s/<(\/?)addr-1>(.*)<(\/?)addr-1>\n.*//g;  
    $_ =~ s/<(\/?)addr-2>(.*)<(\/?)addr-2>\n.*//g;  
    $_ =~ s/<(\/?)state>(.*)<(\/?)state>\n.*//g;  
    $_ =~ s/<(\/?)postal-code>(.*)<(\/?)postal-code>\n.*//g;  
    $_ =~ s/<(\/?)country>(.*)<(\/?)country>\n.*//g;  
    $_ =~ s/<(\/?)phone-number>(.*)<(\/?)phone-number>\n.*//g;  
    $_ =~ s/<(\/?)fax-number>(.*)<(\/?)fax-number>\n.*//g;  
    $_ =~ s/<(\/?)url-addr>(.*)<(\/?)url-addr>\n.*//g;  
    $_ =~ s/<(\/?)email-addr>(.*)<(\/?)email-addr>\n.*//g;  
    $_ =~ s/<(\/?)pay-flag>(.*)<(\/?)pay-flag>\n.*//g;  
    $_ =~ s/<(\/?)ad-description>(.*)<(\/?)ad-description>\n.*//g;  
    $_ =~ s/<(\/?)order-source>(.*)<(\/?)order-source>\n.*//g;  
    $_ =~ s/<(\/?)order-status>(.*)<(\/?)order-status>\n.*//g;  
    $_ =~ s/<(\/?)payor-acct>(.*)<(\/?)payor-acct>\n.*//g;  
    $_ =~ s/<(\/?)agency-flag>(.*)<(\/?)agency-flag>\n.*//g;  
    $_ =~ s/<(\/?)rate-note>(.*)<(\/?)rate-note>\n.*//g;  
    $_ =~ s/<ad-content>(.*)\n.*//g;  
    $_ =~ s/\t(.*)\n.*//g;  
    $_ =~ s/<\/ad-content>(.*)\n.*//g;  
}  
close( READFILE1 );  

Is there an easier way of doing this? I don't want to use any modules. I know that it might make this easier but the file I am reading has a lot of data in it.

7
  • 16
    Nooooooo! stackoverflow.com/questions/1732348/… Commented Jun 1, 2010 at 14:56
  • 1
    I just realized that sometimes code can make my eyes bleed. Commented Jun 1, 2010 at 16:20
  • 1
    also, s/// binds to $_ by default, the $_ =~ is totally noise. Commented Jun 1, 2010 at 16:48
  • First of all, regular expressions can't parse XML. Second, using a module has nothing to do with the amount of data you want to process; you will most likely get much better performance using a module instead of rolling your own. Commented Jun 1, 2010 at 19:15
  • 2
    Why are people so scared about using modules? They are much likely more tested and optimized than any code you can write yourself. Commented Jun 2, 2010 at 17:41

6 Answers 6

12

This is horrible (sorry). Regular expressions are not necessarily faster even if you have a lot of data.

Why not use XSLT?


Your stylesheet would basically look like this (if you have only one subclass-code and city element):

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="text" />  

    <xsl:template match="/">
        <xsl:apply-templates select="//subclass-code|//city" />
    </xsl:template>

    <xsl:template match="subclass-code">
       <xsl:value-of select="." /><xsl:text> | </xsl:text>
    </xsl:template>

    <xsl:template match="city">
       <xsl:value-of select="." /><xsl:text>  </xsl:text>
    </xsl:template>
</xsl:stylesheet>

(Updated the code to work with multiple elements. Might not be the best solution ;))

Sign up to request clarification or add additional context in comments.

3 Comments

This. Why do you not want to use modules? Re-inventing the wheel is not only a lot of work, it tends to result in a really suck-ass wheel.
@Blair - I suggest you quickly file a patent on an idea of a "suck-ass wheel" before Apple or IBM beats you.
@DVK: They can both offer so many examples of prior art that it scarcely seems worth the effort.
7

Why wouldn't you use libraries if someone has already written efficient (and dare I say feature-rich) module like XML::Twig to parse XML?

use XML::Twig;

die "Usage: give-me-the-elements.pl <xml_file>\n" unless ($ARGV[0]);

my $twig = XML::Twig->new( twig_handlers => 
                             { 'subclass-code' => sub { print->text, "|"; }, 
                               'city' => sub { print $_->text, "\n"; }, 
                             },
                           pretty_print  => 'indented');

$twig->parsefile($ARGV[0]); 
$twig->purge;

Comments

5

If you need a general XML parsing method, don't use regex. If you just need what you said (remove everything except for the subclass-code and city) and if you are sure that those two tags will appear with no "strange" things inside (xml entities, CDATA sections) and that those tags will not appear inside other CDATA fragments, etc, you can simply do:

$/ = undef; # slurp mode
open(READFILE, "FILENAME");
$t = <READFILE>;
close READFILE;
$t =~ s#^.*<subclass-code>(.*?)</subclass-code>.*<city>(.*?)</city>.*$#$1 - $2#s;
# in case the tags could appear in distinct order - uncomment the following
# $t =~ s#^.*<city>(.*?)</city>.*<subclass-code>(.*?)</subclass-code>.*$#$2 - $1#s;
print $t;

Edit: A little more (ahem) powerful, following poster's requirements:

while( $t =~ m#<pub-code>([^<\s]*).*?<subclass-code>(.*?)</subclass-code>.*?<city>(.*?)</city>#sg) {
  print "$1 : $2 | $3 \n";
}

But please stop here and don't go further, this way leads to hell...

3 Comments

What if the file I am reading has more than one <pub-code> elements and wanted to display all of the results? <pub-code>PJ Projo.com <subclass-code>TEST SUBCLASS</subclass-code> <city>MIAMI</city> </pub-code> <pub-code>PJ Projo.com <subclass-code>TEST SUBCLASS</subclass-code> <city>ORLANDO</city> </pub-code> RESULT TEST SUBCLASS - MIAMI TEST SUBCLASS - ORLANDO
Ahh... can we at least assume that <subclass-code> and <city> always appear (in pairs, in that order)? The more general the problem, the less adequate an regex solution will be.
Yes, they always appear just like the original example. So its <subclass-code> then <city>, but all of the other elements with <pub-code> would still appear. Does that make sense?
5

The easy way of doing this would be to use XML::Simple in conjunction with a dumper (I like XXX, most use Data::Dumper. This will load the XML into a perl data structure where you can cherry pick the attributes you want (or don't want if you prefer to just explicitly delete).

Using the toolset I just suggested you can see a running example of what you want:

use strict;
use warnings;
use XML::Simple;

my $data = XML::Simple::parse_fh( \*DATA );       
my $sub = $data->{'run-date'}{'pub-code'}{'ad-type'};

foreach my $k ( keys %$sub ) {
  delete $sub->{$k}
    unless $k =~ /subclass-code|city/
  ; 
} 

use XXX;
XXX $data;

Comments

1

Pay attention to what the other posters said, it is highly recommended to stay away from regex when parsing markup languages.

However, a pure perl way of accomplishing what you want without any modules and assuming the aforementioned tags do exist is:

my $reg_subclass = '\<city\>';
my $reg_city = '\<subclass\d*\-code\>';

open my $in, "input file";
open my $out, '>' ,"output file";
while ( my $line = <$in> ) {
    if ( $line =~ /$reg_subclass|$reg_city/ ) {
        print $out $line;
    }
}
close $in;
close $out;

Comments

0

I'm not an expert on what Perl supports, but generically, I think you want to use XPath here. (This might be what the Twig library above uses, I'm not sure).

Pseudo-Perl example (please excuse the crudeness; it's been a while since I really used Perl extensively):

$subclassExpr = "/web-export/subclass-code/text()";
$cityExpr = "/web-export/city/text()";

$domObject = xml_dom_parse( $xml_doc );

$subClass = xpath_evaluate( $domObject, $subclassExpr );
$subClass = xpath_evaluate( $domObject, $cityExpr );

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.