Extract text from HTMl/XML tags in Perl

Question

I have a HTTPS response like this

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
        <title>Some tittle &lt;localconfig&gt;
  &lt;key name="ssl_default"&gt;
    &lt;value&gt;sha256&lt;/value&gt;
  &lt;/key&gt;

</title>
    </head>
    <body>
        <h2>Some h2</h2>
        <p>some text:

            <pre>    text &lt;localconfig&gt;
  &lt;key name="ssl_default"&gt;
    &lt;value&gt;sha256&lt;/value&gt;
  &lt;/key&gt;
  &lt;key name="some variable"&gt;
    &lt;value&gt;1024&lt;/value&gt;
  &lt;/key&gt;
&lt;/localconfig&gt;
</pre>
        </p>
        <hr>
        <i>
            <small>Some text</small>
        </i>
        <hr/>
    </body>
</html>

The key's name are statics, and i need to use a variable to grab specific values.
I'm using decide_entities to parse the text to html
Sometimes the key is posted twice in the response, but it's the same value.

XML::LibXML don't help much here since it's not a correct XML file/string.

I tried to use Regex to get it like this

sub get_key {
    my $start = '<key name="'.$_[0].'">\n<value>';
    print $_[1];
    my $end = "</value>";
    print " [*] Trying to get $_[0]\n";
    print "Start: $start  --- End $end";
    if($_[1] =~ /\b$start\b(.*?)\b$end\b/s){
        my $result = $1;
        print $result, "\n\n";
        return $result;
    }
}

get_key("string_to_search", $string_from_response);

I need to extract the key between the key and value

<key name="variable">
 <value>Grab me</value>
</key>

Do you have entities <value> (shown in code) or <value> (referred to in text) ? — zdim
– zdim, Commented Jun 27, 2019 at 22:37
@zdim i have <value> in the original response. Using decode_entities it transforms into <value> — Jose CastilLo Stronghold
– Jose CastilLo Stronghold, Commented Jun 27, 2019 at 23:00
I haven't swallowed the Mojo pill yet --which is to I haven't even looked at it yet-- and I'm a bigger fan of XPath selectors than CSS selectors, but I'd definitely start with Mojo::DOM. The only other lax parser I know is HTML::Parser, and it's dated. — ikegami
– ikegami, Commented Jun 27, 2019 at 23:03
(Of course, I assume you meant XML::LibXML->new->parse_html_string failed to handle it) — ikegami
– ikegami, Commented Jun 27, 2019 at 23:05

ikegami · Accepted Answer · 2019-06-28 20:58:26Z

5

Once you've extracted the embedded XML document, you should use a proper XML parser.

use XML::LibXML qw( );

my $xml_doc = XML::LibXML->new->parse_string($xml);

for my $key_node ($xml_doc->findnodes("/localconfig/key")) {
   my $key = $key_node->getAttribute("name");
   my $val = $key_node->findvalue("value/text()");
   say "$key: $val";
}

So that leaves us with the question how to extract the XML document.

Option 1: XML::LibXML

You could use XML::LibXML and simply tell it to ignore the error (the spurious  tag).

my $html_doc = XML::LibXML->new( recover => 2 )->parse_html_fh($html);
my $xml = encode_utf8( $html_doc->findvalue('/html/body/pre/text()') =~ s/^[^<]*//r );

Option 2: Regex Match

You could probably get away with using a regex pattern match.

use HTML::Entities qw( decode_entities );

my $xml = decode_entities( ( $html =~ m{<pre>[^&]*(.*?)</pre>}s )[0] );

Option 3: Mojo::DOM

You could use Mojo::DOM to extract the embedded XML document.

use Encode    qw( decode encode_utf8 );
use Mojo::DOM qw( );

my $decoded_html = decode($encoding, $html);
my $html_doc = Mojo::DOM->new($decoded_html);    
my $xml = encode_utf8( $html_doc->at('html > body > pre')->text =~ s/^[^<]*//r );

The problem with Mojo::DOM is that you need to know the encoding of the document before you pass the document to the parser (because you must pass it decoded), but you need to parse the document in order to extract the encoding of the document form the document.

(Of course, you could use Mojo::DOM to parse the XML too.)

Note that the HTML fragment <pre></pre> means <pre></pre>, and both XML::LibXML and Mojo::DOM handle this correctly.

edited Jun 28, 2019 at 20:58

answered Jun 27, 2019 at 23:42

ikegami

391k17 gold badges291 silver badges555 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Grinnz Over a year ago

my $xml = Mojo::DOM->new($html)->at('pre')->text should be sufficient (and include decoding entities), if there are other <pre> tags the CSS selector will need to be more specific.

Grinnz Over a year ago

An example of using Mojo::DOM to parse the XML (if you do this, it would be better to leave it decoded from the extraction):

for my $key_node (Mojo::DOM->new->xml(1)->parse($xml)->find('localconfig > key')->each) { my $key = $key_node->{name}; my $val = $key_node->children('value')->first->text; ... }

zdim · Accepted Answer · 2019-07-19 07:38:47Z

1

The hard part of this problem is that the presented document mixes formats -- it has a valid HTML structure, but also with XML-like elements which appear "tossed-in" without a particular pattern. There are ways to disentangle these parts, even as they aren't bulletproof and come with trade-offs.

In this case XML::LibXML can do the whole job, as it can deal with bad data, but note warnings.

use warnings;
use strict;
use feature 'say';

use Encode qw(encode_utf8); 
use XML::LibXML;

my $html_doc = XML::LibXML->new(recover => 2)->parse_html_fh(\*DATA);
my $xml = encode_utf8( 
    $doc->findvalue('/html/body/pre/text()') =~ s/^[^<]*//r 
);
my $xml_doc = XML::LibXML->new->parse_string($xml);

say for $xml_doc->findnodes('//key');  # node object stringifies

__DATA__
<html>
    <head> 
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
        <title>Some tittle &lt;localconfig&gt;
  &lt;key name="ssl_default"&gt;
    &lt;value&gt;sha256&lt;/value&gt;
  &lt;/key&gt;

</title>
    </head>
    <body>
        <h2>Some h2</h2>
        <p>some text:

            <pre>    text &lt;localconfig&gt;
  &lt;key name="ssl_default"&gt;
    &lt;value&gt;sha256&lt;/value&gt;
  &lt;/key&gt;
  &lt;key name="some variable"&gt;
    &lt;value&gt;1024&lt;/value&gt;
  &lt;/key&gt;
&lt;/localconfig&gt;
</pre>
        </p>
        <hr>
        <i>
            <small>Some text</small>
        </i>
        <hr/>
    </body>
</html>

The parser option recover is what allows the above parsing to go through

A true value turns on recovery mode which allows one to parse broken XML or HTML data. [...]

As useful as this can be, it of course begs for extreme caution as we are willfully using bad data (or, rather, non-conforming data here). This case brings two such issues.

Regex is needed for entities. The example deals with those under <pre>, but there may be more. We need to inspect input and may need code changes for different data.
This makes use of the observation that the XML-like "tags" are given by entities (< etc), which are left as they are during parsing and only decoded later. However ...
... this isn't a rule and if some aren't given that way (but rather as <key>), then those can make the library parse the document into a (slightly) different tree. This again requires inspection of input, and possibly code adjustments for any new data.

Thanks to ikegami for bringing up the point of first parsing the data and only then dealing with the entities, for a discussion, and for the XML-code above. The original version of the XML-related code above first decoded and so ended up with a slightly different tree.

Also note that HTML::TreeBuilder does process this data with ignore_unknown set. Then the problem is that these new "tags" (<key> etc) are just data for it, so any practical use of the obtained tree would probably have to rely on regex.

One other way to deal with this data is with the flexible, high-level HTML parser, Marpa::HTML.

A very basic demo

use warnings;
use strict;
use feature 'say';

use Marpa::HTML qw(html);
use HTML::Entities qw(decode_entities);    

my $input = do { local $/; <DATA> };    
my $html = decode_entities($input);

my (@attrs, @cont);

my $marpa_key = Marpa::HTML::html( 
    \$html,
    {
        'key' => sub {
            push @attrs, Marpa::HTML::attributes();
            push @cont, Marpa::HTML::contents();
        },
    }
);

for my $i (0..$#cont) {
    say "For attribute \"name=$attrs[$i]->{name}\" the <key> has: $cont[$i]"
}

__DATA__
...the same as in the first example, data from the question...

This collects views as it parses, using API for attributes and contents, for element <key>.

It may in principle be suitable for your problem since it accepts the mere semantics of <...> as an element. But those aren't treated as XML, what may be one downside if your data relies on XML more than shown. And, of course, this is a different approach with its own rules.

Note that the basic logic and use of the module is that each coderef returns, and this return is used for the element that it fired on; the rest of text is unchanged. So this is natural for changing particular elements of a document.

I've used it differently above, only to collect information about the "tags." That code prints

For attribute "name=ssl_default" the <key> has: 
    <value>sha256</value>

For attribute "name=some variable" the <key> has: 
    <value>1024</value>

edited Jul 19, 2019 at 7:38

answered Jun 28, 2019 at 2:44

zdim

67.2k5 gold badges59 silver badges87 bronze badges

4 Comments

ikegami Over a year ago

Your approach of decoding entities before parsing (not to mention mixing XML and HTML together) will break things! You should do it in two steps as shown earlier.

zdim Over a year ago

@ikegami (1) Yeah, I am aware of problems and I warn. But this does give some handle on the (mixed!) data, which is the thorny part -- so I offer it, considering it still useful (2) With entities decoded one gets good tags; how is that going to break things? (3) All this is for (something like) this example (4) Altogether, there are problems with this but it gives them something to work with. (That "Once you've extracted" is the "devil in the details," as there is no cleanly "embedded XML". The Mojo approach would also require one to hack at it by hand first)

zdim Over a year ago

@ikegami "not to mention mixing XML and HTML together" --- that is precisely why I posted this. That's the hard part, that it's mixed, practically randomly. There's a price to pay, and the other option is regex (together with some other approach, like what you show). I'll edit and make warnings more dire, and will show a sample of Marpa ... but I am afraid that there is no clean way, in general, to parse a document that uses arbitrary (XML-like) tags within HTML.

ikegami Over a year ago

Let us continue this discussion in chat.

Collectives™ on Stack Overflow

Extract text from HTMl/XML tags in Perl

2 Answers 2

2 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related