3

I have a bunch of HTML files, and what I want to do is to look in each HTML file for the keyword 'From Argumbay' and change this with some href that I have. I thought its very simple at first, so what I did is I opended each HTML file and loaded its content into an array (list), then I looked for each keyword and replaced it with s///, and dumped the contents to the file, what the problem? sometimes the keyword can also appear in a href, which in this case I dont want it to be replaced, or it can appear inside some tags and such.

An EXAMPLE: http://www.astrosociety.org/education/surf.html

I would like my script to replace each occurance of the word 'here' with some href that I have in $href, but as you can see, there is another 'here' which is already href'ed, I dont want it to href this one again. In this case there arent additional 'here's there except from the href, but lets assume that there are.

I want to replace the keyword only if its just text, any idea?

BOUUNTY EDIT: Hi, I believe its a simple thing, But seems like it erases all the comments found in the HTML, SHTML file(the main issue is that it erases SSI's in SHTMLs), i tried using: store_comments(1) method on the $html before calling the recursive function, but to no avail. any idea what am I missing here?

4
  • 1
    Without seeing your code, it would be hard to say where the problem is. Commented Oct 10, 2010 at 15:30
  • 1
    Can you give sample HTML lines? Commented Oct 10, 2010 at 15:34
  • Maybe the accepted answer should get the bounty too? :) Commented Oct 14, 2010 at 22:11
  • I accepted, it says 'You may award the bounty in 7 hours' , why? Commented Oct 15, 2010 at 6:01

3 Answers 3

8
+50

To do this with HTML::TreeBuilder, you would read the file, modify the tree, and write it out (to the same file, or a different file). This is fairly complex, because you're trying to convert part of a text node into a tag, and because you have comments that can't move.

A common idiom with HTML-Tree is to use a recursive function that modifies the tree:

use strict;
use warnings;
use 5.008;

use File::Slurp 'read_file';
use HTML::TreeBuilder;

sub replace_keyword
{
  my $elt = shift;

  return if $elt->is_empty;

  $elt->normalize_content;      # Make sure text is contiguous

  my $content = $elt->content_array_ref;

  for (my $i = 0; $i < @$content; ++$i) {
    if (ref $content->[$i]) {
      # It's a child element, process it recursively:
      replace_keyword($content->[$i])
          unless $content->[$i]->tag eq 'a'; # Don't descend into <a>
    } else {
      # It's text:
      if ($content->[$i] =~ /here/) { # your keyword or regexp here
        $elt->splice_content(
          $i, 1, # Replace this text element with...
          substr($content->[$i], 0, $-[0]), # the pre-match text
          # A hyperlink with the keyword itself:
          [ a => { href => 'http://example.com' },
            substr($content->[$i], $-[0], $+[0] - $-[0]) ],
          substr($content->[$i], $+[0])   # the post-match text
        );
      } # end if text contains keyword
    } # end else text
  } # end for $i in content index
} # end replace_keyword


my $content = read_file('foo.shtml');

# Wrap the SHTML fragment so the comments don't move:
my $html = HTML::TreeBuilder->new;
$html->store_comments(1);
$html->parse("<html><body>$content</body></html>");

my $body = $html->look_down(qw(_tag body));
replace_keyword($body);

# Now strip the wrapper to get the SHTML fragment back:
$content = $body->as_HTML;
$content =~ s!^<body>\n?!!;
$content =~ s!</body>\s*\z!!;

print STDOUT $content; # Replace STDOUT with a suitable filehandle

The output from as_HTML will be syntactically correct HTML, but not necessarily nicely-formatted HTML for people to view the source of. You can use HTML::PrettyPrinter to write out the file if you want that.

Sign up to request clarification or add additional context in comments.

10 Comments

WOOOOOOOOOOOOOOOOOOOOOOOOW! Seriously man, where did you come from? I couldnt ask for a better solution! Amazing. It works perfect, but not I will need few hours to understand what you did there (-: Thanks a lot!
I use HTML-Tree quite a bit. Also, the substr expressions are just copied out of the docs for @-, because using $&, etc. slows down your program.
@soulSurfer2010, you can't use new_from_file if you want to keep comments, because you have to call store_comments before loading the file. Instead, call new, then store_comments, then parse_file.
@soulSurfer2010, that's a current limitation of HTML::TreeBuilder. Everything is a child of the <html> node, even comments that appeared before or after it.
@soulSurfer2010, I've added a workaround to my example. Since your example isn't a complete HTML document, you can wrap it in a <body> tag, and the comments won't get rearranged.
|
3

If tags matter in your search and replace, you'll need to use HTML::Parser.

This tutorial looks a bit easier to understand than the documentation with the module.

7 Comments

Can I use HTML::TreeBuilder instead?? I'm asking coz I never used any of them.
@soulSurfer2010, yes HTML::TreeBuilder can help you do that. (It's built on top of HTML::Parser.)
@soulSurfer2010 Yeah, that looks like it would work too. The real point I was making is that you'll need to actually parse the HTML, not just apply regexes to the source, which is what I'm guessing you're doing based on what little info you provided.
Yes, I tried using regex's all worked fine, until I had something similar to this: 'From Argum bay in love' which was already in a href, then what my script done, is href'ing it again, which not what I'm looking for. only if the text is NOT already href'ed then I want to replace it with my href (=hyperlink)
Well, I can use HTML::TreeBuilder or HTML::TokeParser to find if a keyword is href'ed, but my problem at the moment is, if its not, how do I replace it to my href, since I'm parsing it using the the module and not directly from a list which I can replace stuff and then print to a file.... any suggestions?
|
1

If you wanted to go a regular-expression-only type method and you're prepared to accept the following provisos:

  • this will not work correctly within HTML comments
  • this will not work where the < or > character is used within a tag
  • this will not work where the < or > character is used and not part of a tag
  • this will not work where a tag spans multiple lines (if you're processing one line at a time)

If any of the above conditions do exist then you will have to use one of the HTML/XML parsing strategies outlined by other answers.

Otherwise:

my $searchfor = "From Argumbay";
my $replacewith = "<a href='http://google.com/?s=Argumbay'>From_Argumbay</a>";

1 while $html =~ s/
  \A             # beginning of string
  (              # group all non-searchfor text
    (            # sub group non-tag followed by tag
      [^<]*?     # non-tags (non-greedy)
      <[^>]*>    # whole tags
    )*?          # zero or more (non-greedy)
  )
  \Q$searchfor\E # search text
/$1$replacewith/sx;

Note that this will NOT work if $searchfor matches $replacetext (so don't put "From Argumbay" back into the replacement text).

1 Comment

I already came up with some similar solution a minutes ago before visiting this site today, and I couldn't accept these provisions. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.