How can I extract or change links in HTML with Perl?

Question

I have this input text:

<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body><table cellspacing="0" cellpadding="0" border="0" align="center" width="603">   <tbody><tr>     <td><table cellspacing="0" cellpadding="0" border="0" width="603">       <tbody><tr>         <td width="314"><img height="61" width="330" src="/Elearning_Platform/dp_templates/dp-template-images/awards-title.jpg" alt="" /></td>         <td width="273"><img height="61" width="273" src="/Elearning_Platform/dp_templates/dp-template-images/awards.jpg" alt="" /></td>       </tr>     </tbody></table></td>   </tr>   <tr>     <td><table cellspacing="0" cellpadding="0" border="0" align="center" width="603">       <tbody><tr>         <td colspan="3"><img height="45" width="603" src="/Elearning_Platform/dp_templates/dp-template-images/top-bar.gif" alt="" /></td>       </tr>       <tr>         <td background="/Elearning_Platform/dp_templates/dp-template-images/left-bar-bg.gif" width="12"><img height="1" width="12" src="/Elearning_Platform/dp_templates/dp-template-images/left-bar-bg.gif" alt="" /></td>         <td width="580"><p>&nbsp;what y all heard?</p><p>i'm shark oysters.</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p>             <p>&nbsp;</p></td>         <td background="/Elearning_Platform/dp_templates/dp-template-images/right-bar-bg.gif" width="11"><img height="1" width="11" src="/Elearning_Platform/dp_templates/dp-template-images/right-bar-bg.gif" alt="" /></td>       </tr>       <tr>         <td colspan="3"><img height="31" width="603" src="/Elearning_Platform/dp_templates/dp-template-images/bottom-bar.gif" alt="" /></td>       </tr>     </tbody></table></td>   </tr> </tbody></table> <p>&nbsp;</p></body></html>

As you can see, there's no newline in this chunk of HTML text, and I need to look for all image links inside, copy them out to a directory, and change the line inside the text to something like ./images/file_name.

Currently, the Perl code that I'm using looks like this:

my ($old_src,$new_src,$folder_name);
    foreach my $record (@readfile) {
        ## so the if else case for the url replacement block below will be correct
        $old_src = "";
        $new_src = "";
        if ($record =~ /\<img(.+)/){
            if($1=~/src=\"((\w|_|\\|-|\/|\.|:)+)\"/){
                $old_src = $1;
                my @tmp = split(/\/Elearning/,$old_src);
                $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
                push (@images, $new_src);
                $folder_name = "images";
            }## end if
        }
        elsif($record =~ /background=\"(.+\.jpg)/){
            $old_src = $1;
            my @tmp = split(/\/Elearning/,$old_src);
            $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
            push (@images, $new_src);
            $folder_name = "images";
        }
        elsif($record=~/\<iframe(.+)/){
            if($1=~/src=\"((\w|_|\\|\?|=|-|\/|\.|:)+)\"/){
                $old_src = $1;
                my @tmp = split(/\/Elearning/,$old_src);
                $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
                ## remove the ?rand behind the html file name
                if($new_src=~/\?rand/){
                    my ($fname,$rand) = split(/\?/,$new_src);
                    $new_src = $fname;
                    my ($fname,$rand) = split(/\?/,$old_src);
                    $old_src = $fname."\\?".$rand;
                }
        print "old_src::$old_src\n"; ##s7test
        print "new_src::$new_src\n\n"; ##s7test
                push (@iframes, $new_src);
                $folder_name = "iframes";
            }## end if
        }## end if

        my $new_record = $record;
        if($old_src && $new_src){
            $new_record =~ s/$old_src/$new_src/ ;
    print "new_record:$new_record\n"; ##s7test
            my @tmp = split(/\//,$new_src);
            $new_record =~ s/$new_src/\.\\$folder_name\\$tmp[-1]/;
##  print "new_record2:$new_record\n\n"; ##s7test
        }## end if
        print WRITEFILE $new_record;
    } # foreach

This is only sufficient to handle HTML text with newlines in them. I thought only looping the regex statement, but then i would have to change the matching line to some other text.

Do you have any idea if there an elegant Perl way to do this? Or maybe I'm just too dumb to see the obvious way of doing it, plus I know putting global option doesn't work.

thanks. ~steve

htmlRegexParserQuestions++ (Obviously, there has to be one each day) — Tomalak
– Tomalak, Commented Dec 12, 2008 at 7:09

PhiLho · Accepted Answer · 2008-12-12 06:22:46Z

10

There are excellent HTML parsers for Perl, learn to use them and stick with that. HTML is complex, allows > in attributes, heavily use nesting, etc. Using regexes to parse it, beyond very simple tasks (or machine generated code), is prone to problems.

answered Dec 12, 2008 at 6:22

PhiLho

41.3k6 gold badges100 silver badges136 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

melaos Over a year ago

hi there, i'm using mod perl and we're running in unix, i need management approval to add a module, so was hoping to find a simple perl way to get it done or maybe default modules in mod perl. thanks

brian d foy Over a year ago

well, you can always look at the module source. As for management, you can tell them that someone has already done it correctly and if you get to use the existing correct solution, they save time and money and you can move onto the next problem.

melaos Over a year ago

makes sense, i would much rather use the test proven method that another one of my horrendous hack...hope my pointy haired boss oblige.

Brad Gilbert Over a year ago

Nothing like re-inventing the wheel, and ending up with a rectangular 'wheel'.

brian d foy · Accepted Answer · 2008-12-12 07:43:09Z

4

I think you want my HTML::SimpleLinkExtor module:

use HTML::SimpleLinkExtor;

my $extor = HTML::SimpleLinkExtor->new;
$extor->parse_file( $file );

my @imgs = $extor->img;

I'm not sure what exactly you're trying to do, but it surely sounds like one of the HTML parsing modules should do the trick if mine doesn't.

answered Dec 12, 2008 at 7:43

brian d foy

134k31 gold badges214 silver badges613 bronze badges

2 Comments

melaos Over a year ago

well basically, i'm trying to export the html out as an external file, thus i need to copy the image and also export out the images into a image folder and change the img src into the original html.

brian d foy Over a year ago

That's the sort of information you should include in your question, not buried in a comment. :)

VonC · Accepted Answer · 2008-12-12 07:10:20Z

2

If you must avoid any additional module, like an HTML parser, you could try:

while ($string =~ m/(?:\<\s*(?:img|iframe)[^\>]+src\s*=\s*\"((?:\w|_|\\|-|\/|\.|:)+)\"|background\s*=\s*\"([^\>]+\.jpg)|\<\s*iframe)/g) {
  $old_src = $1;
            my @tmp = split(/\/Elearning/,$old_src);
                    $new_src = "/media/www/vprimary/Elearning".$tmp[-1];
  if($new_src=~/\?rand/){
    // remove rand and push in @iframes
  else
  {
    // push into @images
  }
}

That way, you would apply this regex on all the source (newlines included), and have a more compact code (plus, you would take into account any extra space between attributes and their values)

answered Dec 12, 2008 at 7:10

VonC

1.4m569 gold badges4.8k silver badges5.7k bronze badges

3 Comments

Axeman Over a year ago

People really should leave comments for down-voting. +1 because you're answering for a particular all-too-real case.

VonC Over a year ago

Just got back to my post. That was down-voted ? Sure, an HTML parser is the way to go, but I like to also answer to the actual case of the user. Thank you Axeman for recognizing this "answer" for what it is.

melaos Over a year ago

yea, this answer match my case properly as i really can't simply introduce more modules usage unless necessary :)

Collectives™ on Stack Overflow

How can I extract or change links in HTML with Perl?

3 Answers 3

4 Comments

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related