How can I modify HTML files in Perl?

Question

I have a bunch of HTML files, and what I want to do is to look in each HTML file for the keyword 'From Argumbay' and change this with some href that I have. I thought its very simple at first, so what I did is I opended each HTML file and loaded its content into an array (list), then I looked for each keyword and replaced it with s///, and dumped the contents to the file, what the problem? sometimes the keyword can also appear in a href, which in this case I dont want it to be replaced, or it can appear inside some tags and such.

An EXAMPLE: http://www.astrosociety.org/education/surf.html

I would like my script to replace each occurance of the word 'here' with some href that I have in $href, but as you can see, there is another 'here' which is already href'ed, I dont want it to href this one again. In this case there arent additional 'here's there except from the href, but lets assume that there are.

I want to replace the keyword only if its just text, any idea?

BOUUNTY EDIT: Hi, I believe its a simple thing, But seems like it erases all the comments found in the HTML, SHTML file(the main issue is that it erases SSI's in SHTMLs), i tried using: store_comments(1) method on the $html before calling the recursive function, but to no avail. any idea what am I missing here?

Without seeing your code, it would be hard to say where the problem is. — Ether
– Ether, Commented Oct 10, 2010 at 15:30
I accepted, it says 'You may award the bounty in 7 hours' , why? — snoofkin
– snoofkin, Commented Oct 15, 2010 at 6:01

cjm · Accepted Answer · 2010-10-14 22:11:11Z

8

+50

To do this with HTML::TreeBuilder, you would read the file, modify the tree, and write it out (to the same file, or a different file). This is fairly complex, because you're trying to convert part of a text node into a tag, and because you have comments that can't move.

A common idiom with HTML-Tree is to use a recursive function that modifies the tree:

use strict;
use warnings;
use 5.008;

use File::Slurp 'read_file';
use HTML::TreeBuilder;

sub replace_keyword
{
  my $elt = shift;

  return if $elt->is_empty;

  $elt->normalize_content;      # Make sure text is contiguous

  my $content = $elt->content_array_ref;

  for (my $i = 0; $i < @$content; ++$i) {
    if (ref $content->[$i]) {
      # It's a child element, process it recursively:
      replace_keyword($content->[$i])
          unless $content->[$i]->tag eq 'a'; # Don't descend into <a>
    } else {
      # It's text:
      if ($content->[$i] =~ /here/) { # your keyword or regexp here
        $elt->splice_content(
          $i, 1, # Replace this text element with...
          substr($content->[$i], 0, $-[0]), # the pre-match text
          # A hyperlink with the keyword itself:
          [ a => { href => 'http://example.com' },
            substr($content->[$i], $-[0], $+[0] - $-[0]) ],
          substr($content->[$i], $+[0])   # the post-match text
        );
      } # end if text contains keyword
    } # end else text
  } # end for $i in content index
} # end replace_keyword


my $content = read_file('foo.shtml');

# Wrap the SHTML fragment so the comments don't move:
my $html = HTML::TreeBuilder->new;
$html->store_comments(1);
$html->parse("<html><body>$content</body></html>");

my $body = $html->look_down(qw(_tag body));
replace_keyword($body);

# Now strip the wrapper to get the SHTML fragment back:
$content = $body->as_HTML;
$content =~ s!^<body>\n?!!;
$content =~ s!</body>\s*\z!!;

print STDOUT $content; # Replace STDOUT with a suitable filehandle

The output from as_HTML will be syntactically correct HTML, but not necessarily nicely-formatted HTML for people to view the source of. You can use HTML::PrettyPrinter to write out the file if you want that.

edited Oct 14, 2010 at 22:11

answered Oct 11, 2010 at 0:17

cjm

62.2k9 gold badges133 silver badges179 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

snoofkin Over a year ago

WOOOOOOOOOOOOOOOOOOOOOOOOW! Seriously man, where did you come from? I couldnt ask for a better solution! Amazing. It works perfect, but not I will need few hours to understand what you did there (-: Thanks a lot!

cjm Over a year ago

I use HTML-Tree quite a bit. Also, the substr expressions are just copied out of the docs for @-, because using $&, etc. slows down your program.

cjm Over a year ago

@soulSurfer2010, you can't use new_from_file if you want to keep comments, because you have to call store_comments before loading the file. Instead, call new, then store_comments, then parse_file.

cjm Over a year ago

@soulSurfer2010, that's a current limitation of HTML::TreeBuilder. Everything is a child of the <html> node, even comments that appeared before or after it.

cjm Over a year ago

@soulSurfer2010, I've added a workaround to my example. Since your example isn't a complete HTML document, you can wrap it in a <body> tag, and the comments won't get rearranged.

|

cjm · Accepted Answer · 2010-10-10 16:06:22Z

3

If tags matter in your search and replace, you'll need to use HTML::Parser.

This tutorial looks a bit easier to understand than the documentation with the module.

edited Oct 10, 2010 at 16:06

cjm

62.2k9 gold badges133 silver badges179 bronze badges

answered Oct 10, 2010 at 15:50

Brad Mace

28k19 gold badges110 silver badges152 bronze badges

7 Comments

snoofkin Over a year ago

Can I use HTML::TreeBuilder instead?? I'm asking coz I never used any of them.

cjm Over a year ago

@soulSurfer2010, yes HTML::TreeBuilder can help you do that. (It's built on top of HTML::Parser.)

Brad Mace Over a year ago

@soulSurfer2010 Yeah, that looks like it would work too. The real point I was making is that you'll need to actually parse the HTML, not just apply regexes to the source, which is what I'm guessing you're doing based on what little info you provided.

snoofkin Over a year ago

Yes, I tried using regex's all worked fine, until I had something similar to this: 'From Argum bay in love' which was already in a href, then what my script done, is href'ing it again, which not what I'm looking for. only if the text is NOT already href'ed then I want to replace it with my href (=hyperlink)

snoofkin Over a year ago

Well, I can use HTML::TreeBuilder or HTML::TokeParser to find if a keyword is href'ed, but my problem at the moment is, if its not, how do I replace it to my href, since I'm parsing it using the the module and not directly from a list which I can replace stuff and then print to a file.... any suggestions?

|

PP. · Accepted Answer · 2010-10-11 08:08:41Z

1

If you wanted to go a regular-expression-only type method and you're prepared to accept the following provisos:

this will not work correctly within HTML comments
this will not work where the < or > character is used within a tag
this will not work where the < or > character is used and not part of a tag
this will not work where a tag spans multiple lines (if you're processing one line at a time)

If any of the above conditions do exist then you will have to use one of the HTML/XML parsing strategies outlined by other answers.

Otherwise:

my $searchfor = "From Argumbay";
my $replacewith = "<a href='http://google.com/?s=Argumbay'>From_Argumbay</a>";

1 while $html =~ s/
  \A             # beginning of string
  (              # group all non-searchfor text
    (            # sub group non-tag followed by tag
      [^<]*?     # non-tags (non-greedy)
      <[^>]*>    # whole tags
    )*?          # zero or more (non-greedy)
  )
  \Q$searchfor\E # search text
/$1$replacewith/sx;

Note that this will NOT work if $searchfor matches $replacetext (so don't put "From Argumbay" back into the replacement text).

answered Oct 11, 2010 at 8:08

PP.

10.9k7 gold badges48 silver badges60 bronze badges

1 Comment

snoofkin Over a year ago

I already came up with some similar solution a minutes ago before visiting this site today, and I couldn't accept these provisions. Thanks!

Collectives™ on Stack Overflow

How can I modify HTML files in Perl?

3 Answers 3

10 Comments

7 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

10 Comments

7 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related