2

I'm reading an HTML file, trying to get some information out of it. I've tried HTML parsers, but can't figure out how to use them to get key text out. The original reads the html file, but this version is a minimal working example for StackOverflow purposes.

#!/usr/bin/env perl

use 5.036;
use warnings FATAL => 'all';
use autodie ':default';
use Devel::Confess 'color';

sub regex_test ( $string, $regex ) {
    if ($string =~ m/$regex/s) {
        say "$string matches $regex";
    } else {
        say "$string doesn't match $regex";
    }
}
# the HTML text is $s
my $s = '      rs577952184 was merged into
      
        <a target="_blank"
           href="rs59222162">rs59222162</a>
      
';

regex_test ( $s, 'rs\d+ was merged into.*\<a target="_blank".+href="rs(\d+)/');

however, this doesn't match.

I think that the problem is the newline after "merged into" isn't matching.

How can I alter the above regex to match $s?

7
  • I think you need to escape the backslashes in the string. Commented Nov 4, 2022 at 21:19
  • @Barmar but the original HTML, which contains the string, cannot be modified. I'm only trying to figure out how to change $regex Commented Nov 4, 2022 at 21:21
  • 2
    href="rs(\d+)/ the / looks like a typo for " Commented Nov 4, 2022 at 21:21
  • Not the original HTML, the argument to regex_test(). Commented Nov 4, 2022 at 21:21
  • E.g. regex_test($s, 'rs\\d+ was merged...') Commented Nov 4, 2022 at 21:24

2 Answers 2

2

The problem is the trailing / character in the $regex, which should either be omitted or changed to "

Sign up to request clarification or add additional context in comments.

Comments

2
use strict;
use warnings;
use feature 'say';

my $s = '      rs577952184 was merged into
      
        <a target="_blank"
           href="rs59222162">rs59222162</a>
      
';

my $re = qr/rs\d+ was merged into\s+<a target="_blank"\s+href="rs(\d+)">rs\d+<\/a>/;

regex_test($s,$re);

exit 0;

sub regex_test {
    my $string = shift;
    my $regex  = shift;
    
    say $string =~ m/$regex/s 
        ? "$string matches $regex"
        : "$string doesn't match $regex";
}

Output

      rs577952184 was merged into

        <a target="_blank"
           href="rs59222162">rs59222162</a>

 matches (?^:rs\d+ was merged into\s+<a target="_blank"\s+href="rs(\d+)">rs\d+</a>)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.