2

I'm currently using a perl script with LibXML to process a given XML file. This goes decently well, but if I have a node with both child nodes and free text, I begin to struggle. An example input would be:

<Errors>
    <Error>
        this node works fine
    </Error>
    <Error>
        some text <testTag>with a node</testTag> in between
    </Error>
</Errors>

Expected output:

<Errors>
    <Error>
        this node works fine
    </Error>
    <Error>
        some text HELLOwith a nodeHELLO in between
    </Error>
</Errors>

I tried replaceChild("HELLO", $testTagNode); to replace the nodes with a string, which I could then (if needed) process further with a simple search-replace, but I only run into the "not a blessed reference" error. (I feel like that would have been pretty dirty if it actually worked that way.)

If I try to run a simple search-replace directly on the parent node like this

$error=~s/\</HELLO/g;

it will simply never trigger (no matter if I escape the < or not), because LibXML seems to ignore every tag that I don't specifically ask for; if I try to print out the second Error it will also give me just

some text with a node in between

which is actually a very nice functionality for the rest of the file, but not in this instance.

I can however do

$error->removeChild($testTagNode);

which shows me that it actually does get found, but doesn't help me further. I could theoretically remove the node, save the content, and then just insert the content back into the parent; the problem being that it needs to be at the exact location where it was before. The only thing that I could probably do is read in the entire file as a string, let the basic search-replace run over it BEFORE feeding it into LibXML, but that could create a pretty big overhead and isn't really a nice solution.

I feel like I'm overlooking something substantial, as this looks like a pretty basic tasks to do, but I can't seem to find anything. Maybe I'm just looking in the wrong direction, and there is a completely different approach available. Any help is appreciated.

1
  • Why are you trying to turn an XML Element into plain text in the first place? This feels like an XY Problem. Commented Jul 28, 2015 at 13:32

4 Answers 4

1

Removing the testTag element would remove all of its children too, so we must move the children of each testTag element into the parent of the testTag element before deleting the testTag element. In XML::LibXML, this is done as follows: (Tested)

for my $node ($doc->findnodes('/Errors/Error//testTag')) {
   my $parent = $node->parentNode();

   for my $child_node (
      XML::LibXML::Text->new("HELLO"),
      $node->childNodes(),
      XML::LibXML::Text->new("HELLO"),
   ) {
      $parent->insertBefore($child_node, $node);
   }

   $node->unbindNode();
}

Notes:

  • Handles testTag elements with any number of text and element children.
  • Handles testTag elements that aren't direct children of Error elements. Even handles nested testTag elements. (Use /Errors/Error/testTag instead of /Errors/Error//testTag if you only want to handle direct children of Error elements.)
Sign up to request clarification or add additional context in comments.

5 Comments

Hmm, so effectively creating a new #text element wrapping that childnode? Neater than my approach.
@Sobrique, No. I don't wrap the children with a text node. That doesn't even make any sense since text nodes can't contain other nodes.
OK. I'll have to stare at it a little longer to figure out what's going on then.
@Sobrique, Removing the testTag element would remove all of its children too, so we gotta move the children out first. In array terms, we're doing splice(@$parent, $idx_of_node, 1, "HELLO", @$node, "HELLO"). The code moves the children of the testTag element to the testTag element's parent, positioning them just before the testTag element. Along with the children, the two requested text nodes are created there too. Finally, the now-empty testTag is removed.
The combination of LibXML::Text and insertBefore is exactly what I was looking for, and it works like a charm now. Beautiful little piece of code.
1

In XML::XSH2 which is just a wrapper around XML::LibXML, the following seems to work:

for //testTag/text() {
    insert text 'HELLO' prepend . ;
    insert text 'HELLO' append . ;
    move . replace .. ;
}

Translation back to XML::LibXML is left as an exercise for the reader.

2 Comments

I'm not sure it's acceptable to assume the childnres of testTag are only going to be text nodes.
Thanks for giving a different approach, but I will go with the ones that don't need more packages than I already have running.
1

First off - I don't think what you're trying to do is necessarily particularly useful. However, I'll note - when you're processing your nodes - if you've got a nested node like in your second example, you actually get 3 'nodes' but two of which designated as #PCDATA.

So you could do something like this:

#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
use Data::Dumper;

my $twig = XML::Twig->new( pretty_print => 'indented_a' )->parse( \*DATA );
foreach my $error ( $twig->get_xpath('//Error') ) {
    my $replace_text;
    foreach my $child ( $error->children ) {
        my $tag = $child->tag;
        print "Child: $tag ", $child->trimmed_text, "\n";
        $tag = '' if $tag eq "#PCDATA";
        $replace_text .= $tag . $child->trimmed_text . $tag;
    }

    $error->set_text($replace_text);
    print $error ->trimmed_text, "\n";
}
print $twig->sprint;

__DATA__
<Errors>
    <Error>
        this node works fine
    </Error>
    <Error>
        some text <testTag>with a node</testTag> in between
    </Error>
</Errors>

This turns it into:

<Errors>
  <Error>this node works fine</Error>
  <Error>some texttestTagwith a nodetestTagin between</Error>
</Errors>

Obviously, you can then rename testTag to whatever you like.

(Bear with me - I'll have a look at how to do that in LibXML - unfortunately it doesn't install easily on my Windows box).

OK, so with XML::LibXML:

#!/usr/bin/env perl
use strict;
use warnings;
use XML::LibXML;


my $xml = XML::LibXML->load_xml( IO => \*DATA );
foreach my $error ( $xml -> findnodes ( '//Error' ) ) {
   my $replace_text; 
   foreach my $child ( $error -> childNodes ) {
      my $tag = $child -> nodeName;
      $tag = '' if $tag eq '#text';
      $replace_text .= $tag . $child -> textContent . $tag; 
      $err -> removeChild($child);
   } 
   $err -> appendTextNode($replace); 
}

print $xml -> toString;

__DATA__
<Errors>
    <Error>
        this node works fine
    </Error>
    <Error>
        some text <testTag>with a node</testTag> in between
    </Error>
</Errors>

3 Comments

I'm not sure it's acceptable to assume the children of testTag are only going to be text nodes.
The LibXML solution does work, though there are 3 points which need to be corrected in your code: $err (twice) has to be $error, and in the last line, the $replace has to be $replace_text. (Just leaving this here for future onlookers.) Other than that, works fine. I will accept @ikegami's answer though, as his can deal with nested tags. Currently, your assumption with only text in the testTag nodes holds true, but maybe that changes, and being future-safe is always a good thing. Thank you very much for your help.
Yeah, transcription error - LibXML doesn't install nicely on my Windows box.
-1

This should work

$error='<Errors>
<Error>
    this node works fine
</Error>
<Error>
    some text <testTag>with a node</testTag> in between
</Error>
</Errors>';

$error=~ s/<testTag>/HELLO/gs;
$error=~ s/<\/testTag>/HELLO/gs;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.