2

I have a HTTPS response like this

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
        <title>Some tittle &lt;localconfig&gt;
  &lt;key name="ssl_default"&gt;
    &lt;value&gt;sha256&lt;/value&gt;
  &lt;/key&gt;

</title>
    </head>
    <body>
        <h2>Some h2</h2>
        <p>some text:

            <pre>    text &lt;localconfig&gt;
  &lt;key name="ssl_default"&gt;
    &lt;value&gt;sha256&lt;/value&gt;
  &lt;/key&gt;
  &lt;key name="some variable"&gt;
    &lt;value&gt;1024&lt;/value&gt;
  &lt;/key&gt;
&lt;/localconfig&gt;
</pre>
        </p>
        <hr>
        <i>
            <small>Some text</small>
        </i>
        <hr/>
    </body>
</html>
  • The key's name are statics, and i need to use a variable to grab specific values.
  • I'm using decide_entities to parse the text to html
  • Sometimes the key is posted twice in the response, but it's the same value.

XML::LibXML don't help much here since it's not a correct XML file/string.

I tried to use Regex to get it like this

sub get_key {
    my $start = '<key name="'.$_[0].'">\n<value>';
    print $_[1];
    my $end = "</value>";
    print " [*] Trying to get $_[0]\n";
    print "Start: $start  --- End $end";
    if($_[1] =~ /\b$start\b(.*?)\b$end\b/s){
        my $result = $1;
        print $result, "\n\n";
        return $result;
    }
}

get_key("string_to_search", $string_from_response);

I need to extract the key between the key and value

<key name="variable">
 <value>Grab me</value>
</key>
7
  • 2
    Try Mojo::DOM. It uses CSS rules to traverse HTML. Commented Jun 27, 2019 at 22:25
  • Do you have entities &lt;value&gt; (shown in code) or <value> (referred to in text) ? Commented Jun 27, 2019 at 22:37
  • @zdim i have &lt;value&gt; in the original response. Using decode_entities it transforms into <value> Commented Jun 27, 2019 at 23:00
  • I haven't swallowed the Mojo pill yet --which is to I haven't even looked at it yet-- and I'm a bigger fan of XPath selectors than CSS selectors, but I'd definitely start with Mojo::DOM. The only other lax parser I know is HTML::Parser, and it's dated. Commented Jun 27, 2019 at 23:03
  • (Of course, I assume you meant XML::LibXML->new->parse_html_string failed to handle it) Commented Jun 27, 2019 at 23:05

2 Answers 2

5

Once you've extracted the embedded XML document, you should use a proper XML parser.

use XML::LibXML qw( );

my $xml_doc = XML::LibXML->new->parse_string($xml);

for my $key_node ($xml_doc->findnodes("/localconfig/key")) {
   my $key = $key_node->getAttribute("name");
   my $val = $key_node->findvalue("value/text()");
   say "$key: $val";
}

So that leaves us with the question how to extract the XML document.

Option 1: XML::LibXML

You could use XML::LibXML and simply tell it to ignore the error (the spurious </p> tag).

my $html_doc = XML::LibXML->new( recover => 2 )->parse_html_fh($html);
my $xml = encode_utf8( $html_doc->findvalue('/html/body/pre/text()') =~ s/^[^<]*//r );

Option 2: Regex Match

You could probably get away with using a regex pattern match.

use HTML::Entities qw( decode_entities );

my $xml = decode_entities( ( $html =~ m{<pre>[^&]*(.*?)</pre>}s )[0] );

Option 3: Mojo::DOM

You could use Mojo::DOM to extract the embedded XML document.

use Encode    qw( decode encode_utf8 );
use Mojo::DOM qw( );

my $decoded_html = decode($encoding, $html);
my $html_doc = Mojo::DOM->new($decoded_html);    
my $xml = encode_utf8( $html_doc->at('html > body > pre')->text =~ s/^[^<]*//r );

The problem with Mojo::DOM is that you need to know the encoding of the document before you pass the document to the parser (because you must pass it decoded), but you need to parse the document in order to extract the encoding of the document form the document.

(Of course, you could use Mojo::DOM to parse the XML too.)


Note that the HTML fragment <p><pre></pre></p> means <p></p><pre></pre>, and both XML::LibXML and Mojo::DOM handle this correctly.

Sign up to request clarification or add additional context in comments.

2 Comments

my $xml = Mojo::DOM->new($html)->at('pre')->text should be sufficient (and include decoding entities), if there are other <pre> tags the CSS selector will need to be more specific.
An example of using Mojo::DOM to parse the XML (if you do this, it would be better to leave it decoded from the extraction): for my $key_node (Mojo::DOM->new->xml(1)->parse($xml)->find('localconfig > key')->each) { my $key = $key_node->{name}; my $val = $key_node->children('value')->first->text; ... }
1

The hard part of this problem is that the presented document mixes formats -- it has a valid HTML structure, but also with XML-like elements which appear "tossed-in" without a particular pattern. There are ways to disentangle these parts, even as they aren't bulletproof and come with trade-offs.

In this case XML::LibXML can do the whole job, as it can deal with bad data, but note warnings.

use warnings;
use strict;
use feature 'say';

use Encode qw(encode_utf8); 
use XML::LibXML;

my $html_doc = XML::LibXML->new(recover => 2)->parse_html_fh(\*DATA);
my $xml = encode_utf8( 
    $doc->findvalue('/html/body/pre/text()') =~ s/^[^<]*//r 
);
my $xml_doc = XML::LibXML->new->parse_string($xml);

say for $xml_doc->findnodes('//key');  # node object stringifies

__DATA__
<html>
    <head> 
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
        <title>Some tittle &lt;localconfig&gt;
  &lt;key name="ssl_default"&gt;
    &lt;value&gt;sha256&lt;/value&gt;
  &lt;/key&gt;

</title>
    </head>
    <body>
        <h2>Some h2</h2>
        <p>some text:

            <pre>    text &lt;localconfig&gt;
  &lt;key name="ssl_default"&gt;
    &lt;value&gt;sha256&lt;/value&gt;
  &lt;/key&gt;
  &lt;key name="some variable"&gt;
    &lt;value&gt;1024&lt;/value&gt;
  &lt;/key&gt;
&lt;/localconfig&gt;
</pre>
        </p>
        <hr>
        <i>
            <small>Some text</small>
        </i>
        <hr/>
    </body>
</html>

The parser option recover is what allows the above parsing to go through

A true value turns on recovery mode which allows one to parse broken XML or HTML data. [...]

As useful as this can be, it of course begs for extreme caution as we are willfully using bad data (or, rather, non-conforming data here). This case brings two such issues.

  • Regex is needed for entities. The example deals with those under <pre>, but there may be more. We need to inspect input and may need code changes for different data.

  • This makes use of the observation that the XML-like "tags" are given by entities (&lt; etc), which are left as they are during parsing and only decoded later. However ...

  • ... this isn't a rule and if some aren't given that way (but rather as <key>), then those can make the library parse the document into a (slightly) different tree. This again requires inspection of input, and possibly code adjustments for any new data.

Thanks to ikegami for bringing up the point of first parsing the data and only then dealing with the entities, for a discussion, and for the XML-code above. The original version of the XML-related code above first decoded and so ended up with a slightly different tree.

Also note that HTML::TreeBuilder does process this data with ignore_unknown set. Then the problem is that these new "tags" (<key> etc) are just data for it, so any practical use of the obtained tree would probably have to rely on regex.


One other way to deal with this data is with the flexible, high-level HTML parser, Marpa::HTML.

A very basic demo

use warnings;
use strict;
use feature 'say';

use Marpa::HTML qw(html);
use HTML::Entities qw(decode_entities);    

my $input = do { local $/; <DATA> };    
my $html = decode_entities($input);

my (@attrs, @cont);

my $marpa_key = Marpa::HTML::html( 
    \$html,
    {
        'key' => sub {
            push @attrs, Marpa::HTML::attributes();
            push @cont, Marpa::HTML::contents();
        },
    }
);

for my $i (0..$#cont) {
    say "For attribute \"name=$attrs[$i]->{name}\" the <key> has: $cont[$i]"
}

__DATA__
...the same as in the first example, data from the question...

This collects views as it parses, using API for attributes and contents, for element <key>.

It may in principle be suitable for your problem since it accepts the mere semantics of <...> as an element. But those aren't treated as XML, what may be one downside if your data relies on XML more than shown. And, of course, this is a different approach with its own rules.

Note that the basic logic and use of the module is that each coderef returns, and this return is used for the element that it fired on; the rest of text is unchanged. So this is natural for changing particular elements of a document.

I've used it differently above, only to collect information about the "tags." That code prints

For attribute "name=ssl_default" the <key> has: 
    <value>sha256</value>

For attribute "name=some variable" the <key> has: 
    <value>1024</value>

4 Comments

Your approach of decoding entities before parsing (not to mention mixing XML and HTML together) will break things! You should do it in two steps as shown earlier.
@ikegami (1) Yeah, I am aware of problems and I warn. But this does give some handle on the (mixed!) data, which is the thorny part -- so I offer it, considering it still useful (2) With entities decoded one gets good tags; how is that going to break things? (3) All this is for (something like) this example (4) Altogether, there are problems with this but it gives them something to work with. (That "Once you've extracted" is the "devil in the details," as there is no cleanly "embedded XML". The Mojo approach would also require one to hack at it by hand first)
@ikegami "not to mention mixing XML and HTML together" --- that is precisely why I posted this. That's the hard part, that it's mixed, practically randomly. There's a price to pay, and the other option is regex (together with some other approach, like what you show). I'll edit and make warnings more dire, and will show a sample of Marpa ... but I am afraid that there is no clean way, in general, to parse a document that uses arbitrary (XML-like) tags within HTML.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.