Find and replace characters between XML tags

Question

I have an XML file that is not bound by lines. It has the tags <tag1> and </tag1> that has some trashed variables from the code that generated it (I am not able to correct that right now). I would like to be able to change the characters within these tags to correct them. The characters are sometimes special.

I have this Perl one-liner to show me the contents between the tags, but now I want to be able to replace in the file what it has found.

perl -0777 -ne 'while (/(?<=perform_cnt).*?(?=\<\/perform_cnt)/s) {print $& . "\n";      s/perform_cnt.*?\<\/perform_cnt//s}' output_error.txt

Here's an example of the XML. Notice the junk chars in-between the tags perform_cnt.

<text1>120105728</text1><perform_cnt>ÈPm=</perform_cnt>
<text1>120106394</text1><perform_cnt>†AQ;4K\_Ô23{YYÔ@Nx</perform_cnt>

I need to replace these with like a 0.

Please update your question with sample of the input file that you need to process. — Ωmega
– Ωmega, Commented Apr 17, 2012 at 12:52

brian d foy · Accepted Answer · 2012-04-17 15:09:34Z

8

I love XML::Twig for these sorts of things. It takes a little getting used to, but once you understand the design (and a little about DOM processing), many things become extremely easy:

use XML::Twig;

my $xml = <<'HERE';
<root>
<text1>120105728</text1><perform_cnt>ÈPm=</perform_cnt>
<text1>120106394</text1><perform_cnt>†AQ;4K\_Ô23{YYÔ@Nx</perform_cnt>
</root>
HERE

my $twig = XML::Twig->new(   
    twig_handlers => { 
        perform_cnt   => sub { 
            say "Text is " => $_->text;  # get the current text

            $_->set_text( 'Buster' );    # set the new text
            },
      },
    pretty_print => 'indented',
    );

$twig->parse( $xml );
$twig->flush;

With indented pretty printing, I get:

<root>
  <text1>120105728</text1>
  <perform_cnt>Buster</perform_cnt>
  <text1>120106394</text1>
  <perform_cnt>Buster</perform_cnt>
</root>

edited Apr 17, 2012 at 15:09

answered Apr 17, 2012 at 14:43

brian d foy

134k31 gold badges214 silver badges613 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2020-06-20 09:12:55Z

0

It is a bad practice to use regex for xml parsing

Anyway - the code is:

#!/usr/bin/perl

use strict;
use warnings;

my $tag = 'perform_cnt';

open my $fh, '<file.txt' or die $!;
foreach (<$fh>) {
  s/(<$tag>)(.*?)(<\/$tag>)/$1$3/g;
  print "$_";
}
close $fh;

And output is:

<text1>120105728</text1><perform_cnt></perform_cnt>
<text1>120106394</text1><perform_cnt></perform_cnt>

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Apr 17, 2012 at 14:01

Ωmega

44k35 gold badges143 silver badges213 bronze badges

9 Comments

Ωmega Over a year ago

If you wanna eliminate <perform_cnt></perform_cnt> from output, then replace in code /$1$3/ with //.

gaussblurinc Over a year ago

also, output for print "$_" is not the best. use print;

Ωmega Over a year ago

@loldop - If you are looking for short code, then maybe. Otherwise I don't see a reason for that. Short code then can look like s/(<$tag>)(.*?)(<\/$tag>)/$1$3/g && print for <$fh>; replacing the entire foreach loop.

gaussblurinc Over a year ago

it's the same. if you want, use print; print "\n"; OR print "$_\n"; but ordinary i use say function say{ return (@_,"\n");}

Ωmega Over a year ago

@loldop - I know what is that, but it is just not standard use and actually say is from Perl 5.10+ I believe, so not each Perl got it.

|

Collectives™ on Stack Overflow

Find and replace characters between XML tags

2 Answers 2

Comments

It is a bad practice to use regex for xml parsing

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

It is a bad practice to use regex for xml parsing

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related