XML regex complete and working needs optimization in Perl

Question

I have the below regex to parse XML tags inside html code blocks, I can not use xml libs, and it is working with some tests as expected, I just need some experts to optimize it if needed because I will use it to parse many blocks of code to build the whole template so it may be run average 50 times for each template and therefore every clock tick will count for me.

The regex for the XML tags I used is

(<vars\s*([^\!\?\s<>](?:"[^"]*"|'[^']*'|[^"'<>])*)>([^<]*)(<\!\[CDATA\[(.*?)\]\]>)?(</vars>)?)

then I parse the attributes with this regex:

([^\s\=\"\']+)\s*=\s*(?:(")(.*?)"|'(.*?)')

here is the test Perl code:

use strict;
use warnings;
no warnings 'uninitialized';

my $text = <<"END_HTML";
<vars type="var" name="selfopened" content="tag without closing slash" size="30" width="200px" >
<vars type="plug" name="selfclosed" content="self closed tag" size="30" width="200px" />
<vars type="var" name="hasclosing" width="400px" height="300px">content of tag with closing</vars>
<vars id="left-part" width="400px" height="300px"><![CDATA[ 
    cdata start here is may have html tags and 'single' and "double" qoutes
    another cdata line
]]></vars>
<vars name="singlelinecdata" width="400px" height="300px"><![CDATA[cdata start here is may have html tags and 'single' and "double" qoutes]]></vars>
</vars>
END_HTML

while ( $text =~ m{
    (<vars\s*([^\!\?\s<>](?:"[^"]*"|'[^']*'|[^"'<>])*)>([^<]*)(<\!\[CDATA\[(.*?)\]\]>)?(</vars>)?)
}sxgi ) {
    my ($match, $attrs, $value, $cdata, $cdata_content, $closing) = ( $1, $2, $3, $4, $5, $6 );
    print "match: $match, attrs: $attrs, value: $value, cdata: $cdata, closing: $closing\n\n";

    # parse attributes to key, value pairs
     while ( $attrs =~ m{
        ([^\s\=\"\']+)\s*=\s*(?:(")(.*?)"|'(.*?)')
    }sxg ) {
        my $key = $1;
        my $val = ( $2 ? $3 : $4 );
        print "attr: $key=$val\n";
    }
    print "\n";
}

I am sorry to have to say this, but your solution doesn't work as well as you think it does. For s start, your subexpression (?:"[^"]*"|'[^']*'|[^"'<>])* may as well be just [^>]* for all the good it's doing. Regular expressions are just about excusable for parsing extremely simple bits of XML, but this code contains several bugs waiting to reveal themselves. Please explain your reasons for avoiding an XML library. — Borodin
– Borodin, Commented Jun 1, 2014 at 19:24
The code works now after Miller removed the extra lines I inserted to make it readable. As for the attrs regex, attrs values may be single quoted, double quoted or not quoted at all and it works. — daliaessam
– daliaessam, Commented Jun 1, 2014 at 20:09
The functionality of your code hasn't changed. Miller edited it just to make it readable. I assure you that it doesn't work as you think it does, and it is just a matter of luck that it seems to function with the data you have used. You will get errors if you use this code on live data, and you may not notice them because it could silently ignore some of the data. Please explain your reasons for avoiding an XML library. — Borodin
– Borodin, Commented Jun 1, 2014 at 21:10
@Miller: Re "also added closing tag to data so it will actually work", the tag was just HTML before and it was there in the code. Please chip in with your thoughts on this guy's question; it's clear he doesn't believe me. — Borodin
– Borodin, Commented Jun 1, 2014 at 21:13
@daliaessam: first rule of optimization: do not optimize until you know it isn't fast enough. — ysth
– ysth, Commented Jun 1, 2014 at 21:54

Miller · Accepted Answer · 2014-06-02 01:14:35Z

I would strongly advise you to use a prebuilt framework for this. Not knowing your exact use cases, I can't advise you fully, but perhaps Template::Toolkit would work.

If you insist on trying to solve this using regular expressions, I'd advise you to make it more limitting:

Don't allow self-closing tags
Don't allow nested vars tags (maybe you already don't?)

Also, from a parsing perspective, I'd advise some design changes:

When pulling potential matches, be as permissive as possible.
Then when parsing the found matches, be as restrictive as possible with good error reporting

Something like the following is a reworking of your script:

use strict;
use warnings;

use Data::Dump;

my $text = do {local $/; <DATA>};

while ( $text =~ m{<vars\b(.*?)\s*>(.*?)</vars>}sxgi ) {
    my ($attr, $content) = ($1, $2);

    # Separate and validate Attributes:
    my %attr;
    while ($attr =~ /\G(?:\s+([^=]+)=(?:"([^"]*)"|'([^']*)'|(\S+))|(.+))/sg) {
        if (defined $5) {
            die "Invalid attribute found: <$5> in $attr";
        }
        $attr{$1} = $2 // $3 // $4;
    }

    # Do any processing of content here including anything for CDATA

    # Done:
    dd {
        attr => \%attr,
        content => $content,
    }
}

__DATA__
<vars type="var" name="selfopened" content="tag without closing slash" size="30" width="200px" ></vars>
<vars type="plug" name="selfclosed" content="self closed tag" size="30" width="200px" ></vars>
<vars type="var" name="hasclosing" width="400px" height="300px">content of tag with closing</vars>
<vars id="left-part" width="400px" height="300px"><![CDATA[ 
    cdata start here is may have html tags and 'single' and "double" qoutes
    another cdata line
]]></vars>
<vars name="singlelinecdata" width="400px" height="300px"><![CDATA[cdata start here is may have html tags and 'single' and "double" qoutes]]></vars>

Outputs:

{
  attr => {
    content => "tag without closing slash",
    name    => "selfopened",
    size    => 30,
    type    => "var",
    width   => "200px",
  },
  content => "",
}
{
  attr => {
    content => "self closed tag",
    name    => "selfclosed",
    size    => 30,
    type    => "plug",
    width   => "200px",
  },
  content => "",
}
{
  attr => { height => "300px", name => "hasclosing", type => "var", width => "400px" },
  content => "content of tag with closing",
}
{
  attr => { height => "300px", id => "left-part", width => "400px" },
  content => "<![CDATA[ \n    cdata start here is may have html tags and 'single' and \"double\" qoutes\n    another cdata line\n]]>",
}
{
  attr => { height => "300px", name => "singlelinecdata", width => "400px" },
  content => "<![CDATA[cdata start here is may have html tags and 'single' and \"double\" qoutes]]>",
}

mirod · Accepted Answer · 2014-06-02 07:51:23Z

2 comments:

From your question it looks like you're building a templating system. Why? There are many already written, that work well, some with caching (Template::Toolkit for example, as already mentioned)

Then if you really want to build your own, why use a semi-XML format?

If you want an XML syntax, then use a real parser. The XML spec is tiny by standard formats, but it's still 32 pages long. I am not sure it's worth it implementing it all for your templating system! For example I don't think your parser deals with more than 1 CDATA section in the content, which is allowed by XML, and in fact required if you want to include the string ']]>' in your CDATA. So essentially you are implementing a subset of XML that's not defined, except by the implementation of the parser. I am not sure that's going to be great for users, or for you when you will have to debug your regexp or improve it to accept more XML features. For reference, have a look at a shallow XML parser in Perl: http://www.cs.sfu.ca/~cameron/REX.html#AppA. It looks like it takes a bit more than a single regexp, and I don't think it even deals with entities.

Alternatively, you could use a format that is not XML, with a simple, unambiguous description that's less than 32 pages. It would be easier to parse and at the very least you would avoid comments telling you to use a proper parser ;--)

daliaessam · Accepted Answer · 2014-06-03 14:54:15Z

I reduced the xml tags to two, one self closing and the other has closing tag:

self closing xml tag example:

<vars type="mod" name="weather" city="knox" countery="usa" />

xml tag has closing tag example:

<vars type="var" name="hasclosing" width="400px" height="300px">content of tag with closing</vars>

then used one regex to match each one of these two tags:

(<vars\s*([^\!\?\s<>](?:"[^"]*"|'[^']*'|[^"'<>])*)/>)

(<vars(\s*[^\!\?\s<>](?:"[^"]*"|'[^']*'|[^"'<>\/])*)>(.*?)<\/vars>)

below is the Perl test code.

use strict;
use warnings;
no warnings 'uninitialized';

my ($match, $attrs, $value, $cdata, $cdata_content, $closing);

my $text = <<HTML;
{first_name} <vars name="first_name" /> {first_name_notes}
{last_name} <vars name="last_name" /> {last_name_notes}
{email} <vars type="var" name='email' /> {email_notes}
{website} <vars type="var" name="website" /> {website_notes}
<vars type="mod" name="weather" city="knox" countery="usa" />
<vars type="var" name="hasclosing" width="400px" height="300px">content of tag with closing</vars>
<vars type="action" name="Stats::Users->active" />

<vars id="left-part" width="400px" height="300px"><![CDATA[ 
    cdata start here is may have html tags and 'single' and "double" qoutes
    another cdata line
]]></vars>
<vars name="singlelinecdata" width="400px" height="300px" content="ahmed<b>class/subclass"><![CDATA[cdata start here is may have html tags and 'single' and "double" qoutes]]></vars>

HTML

    while ( $text =~ m{
        (<vars\s*([^\!\?\s<>](?:"[^"]*"|'[^']*'|[^"'<>])*)/>)|(<vars(\s*[^\!\?\s<>](?:"[^"]*"|'[^']*'|[^"'<>\/])*)>(.*?)<\/vars>)
    }sxgi ) {

        if ($1) {
            ($match, $attrs, $value) = ($1, $2, undef);
        }
        else {
            ($match, $attrs, $value) = ( $3, $4, $5);
        }
        print "match:\n$match \nattrs: $attrs\nvalue: $value\n";

        # parse attributes to key and value pairs
         while ( $attrs =~ m{([^\s\=\"\']+)\s*=\s*(?:(")(.*?)"|'(.*?)')}sxg ) {
                my $key = $1;
                my $val = ( $2 ? $3 : $4 );
                print "attr: $key=$val\n";
        }

        print "\n";
    }

I still need some expert to optimize these regex's if possible.

Collectives™ on Stack Overflow

XML regex complete and working needs optimization in Perl

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related