0

Background


Consider the following input:

<Foo
    Bar="bar"
    Baz="1"
    Bax="bax"
>

After processing, I need it to look like the following:

<Foo
    Bar="bar"
    Baz="1"
    Bax="bax"
    CustomAttribute="TRUE"
>

Implementation


This is all I need to do for no more than 5 files, so using anything other than a regular expression seems like overkill. Anyway, I came up with the following (Perl) regular expression to accomplish this:

$data =~ s/(<\s*Foo)(.*?)>/$1$2 CustomAttribute="TRUE">/sig;

Problems


This works well, however, there is one obvious problem. This sort of pattern is "dumb" because if CustomAttribute has already been added, the operation outlined above will simply append another CustomAttribute=... blindly.

A simple solution, of course, is to write a secondary expression that will attempt to match for CustomAttribute prior to running the replacement operation.

Questions


Since I'm rather new to the scripting language and regular expression worlds, I'm wondering whether it's possible to solve this problem without introducing any host language constructs (i.e., an if-statement in Perl), and simply use a more "intelligent" version of what I wrote above?

6
  • 5
    omg another HTML + regex comment, HTML is not a regular language, use a parser for that Commented Dec 11, 2009 at 17:59
  • If it's only for five files, make the change manually. Commented Dec 11, 2009 at 18:17
  • 2
    It's not so much your unfamiliarity with the language that is creating a problem, but not understanding that this "dumb" solution solves the only problem you posed. When you present a simple example and don't explain the principle behind the transformation, you're going to get solutions that "solve" the simple example, but might not "solve" the general application. So you really want all your tags to have a CustomAttribute="TRUE" attribute? Commented Dec 11, 2009 at 19:34
  • 1
    @Ether: five files doesn't mean he has only five substitutions. Maybe each file has 1,000,000 records :) Commented Dec 11, 2009 at 19:43
  • 3
    Don't use regexes because you think your problem is small. Use them when they are the tool that will give you the right answer. Commented Dec 11, 2009 at 19:44

4 Answers 4

5

I won't beat you over the head with how you should not use a regex for this. I mean, you shouldn't, but you obviously know that from what you said in your question, so moving on...

Something that will accomplish what you're asking for is called a negative lookahead assertion (usually (?!...)), which basically says that you don't want the match to apply if the pattern inside the assertion is found ahead of this point. In your example, you don't want it to apply if CustomAttribute is already present, so:

$data =~ s/(<\s*Foo)(?![^>]*\bCustomAttribute=)(.*?)>/$1$2CustomAttribute="TRUE">/sig;
Sign up to request clarification or add additional context in comments.

3 Comments

Awesome, thanks for actually answering the question and not lecturing about RegEx + XML.
XML allows plain > in attribute values, so <Foo Bar=">"> would fail. And <Foo Bar="CustomAttribute="> would also fail.
@Gumbo: Yes! Thank you for pointing out why regexp for HTML is inherently fragile. Without descent parsing one is bound to find these sorts of anomalies; probably months or years after the fragile code has been in use by users who expected it to be more robust.
5

This sounds like it might be a job for XML::Twig, which can process XML and change parts of it as it runs into them, including adding attributes to tags. I suspect you'd spend as much time getting used to Twig and you would finding a regex solution that only mostly worked. And, at the end you'd know enough Twig to use it on the next project. :)

Comments

4

Time for a lecture I guess ;--)

I am not sure why you think using a full-blown XML processor is overkill. It is actually easier to write the code using the proper tool. A regexp will be more complex and will rely on unwritten assumptions about the data, which is dangerous. Some of those assumptions are likely to be: no '>' in attribute values, no CDATA sections, no non-ascii characters in tag or attribute names, consistent attribute value quoting...

The only thing a regexp will give you is the assurance that the output keeps the original format of the data (in your case the fact that the attributes are each on a separate line). But if your format is consistent that can be done, and if not it should not matter, unless you keep you XML in a line-oriented revision control system.

Here is an example with XML::Twig. It assumes you have enough memory to keep any entire Foo element in memory, and it works even on the admittedly contrived bit of XML in the DATA section. It would probably be just as easy to do with XML::LibXML (read the XML in memory, select all Foo elements, add attribute to each of them, output, that's 5 easy to understand lines by my count).

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

my( $tag, $att, $val)= ( 'Foo', 'CustomAttribute', 'TRUE');

XML::Twig->new(                 # only process those elements
                twig_roots => { $tag => sub { 
                                              # add/set attribute
                                              $_->set_att( $att => $val); 
                                              # output and free memory
                                              $_->flush;
                                            }
                              },
                twig_print_outside_roots => 1, # output everything else
                pretty_print => 'cvs',         # seems to be the right format
              )
         ->parse( \*DATA)  # use parsefile( $file) if parsing... a file
         ->flush;          # not needed in XML::Twig 3.33
__DATA__
<doc>
  <Foo
      Bar="bar"
      Baz="1"
      Bax="bax"
  >
  here is some text
  </Foo>
  <Foo CustomAttribute="TRUE"><Foo no_att="1"/></Foo>
  <bar><![CDATA[<Foo no_att="1">tricked?</Foo>]]></bar>
  <Foo><![CDATA[<Foo no_att="1" CustomAttribute="TRUE">tricked?</Foo>]]></Foo>
  <Foo
      Bar=">"
      Baz="1"
      Bax="bax"
  ></Foo>
  <Foo
      Bar="
>"
      Baz="1"
      Bax="bax"
  ></Foo>
  <Foo
      Bar=">"
      Baz="1"
      Bax="bax"
      CustomAttribute="TRUE"
  ></Foo>
  <Foo
      Bar="
>"
      Baz="1"
      Bax="b
ax"
      CustomAttribute="TR
UE"
  ></Foo>
</doc>

Comments

1

You can send your matches through a function with the 'e' modifier for more processing.

my $str = qq`
<Foo
    Bar="bar"
    Baz="1"
    Bax="bax"
    CustomAttribute="TRUE"
>
<Foo
    Bar="bar"
    Baz="1"
    Bax="bax"
>
`;

sub foo {
    my $guts = shift;
    $guts .= qq` CustomAttribute="TRUE"` if $guts !~ m/CustomAttribute/;
    return $guts;
}
$str =~ s/(<Foo )([^>]*)(>)/$1.foo($2).$3/xsge;

1 Comment

What about <Foo Bar="CustomAttribute">?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.