1

I have HTML tags in a document as follows:

><H2 
align="justify"
><FONT size="+2" color="#008AD9"><B>ACCESS_NUMBER<FONT size="+2" color="#008AD9"><B>
</H2
>

I want to extract only ACCESS_NUMBER from the above HTML text.

How can I do this? I want to make sure only the text between all <H2> tags is extracted. Any help would be appreciated.

2
  • What have you tried? Also is this literally how the code appears (i.e with < strangely wrapped) in the document, or did you improperly format the code section of your question? I did a bit of editing to make it all appear but otherwise left it as it was. The format of the code would effect a regular expression. Commented Jun 21, 2014 at 13:57
  • A parser might be able to extract the text you want if the document is consistent with some definition of "HTML" :-) Commented Jun 21, 2014 at 13:58

3 Answers 3

4

Use Mojo::DOM

#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;
my $HTML = <<"EOF";
<html>
<head>
<title>Test</title>
</head>
<body>
<h2>
<font><b>ACCESS_NUMBER</b></font> 
</h2>
</body>
</html>
EOF

my $dom = Mojo::DOM->new( $HTML );
print $dom->find('h2 font b')->text;

For a 8 minute video tutorial on Mojo::DOM and Mojo::UserAgent check out Mojocast Episode 5

Sign up to request clarification or add additional context in comments.

4 Comments

Upvoted. 99.9% of the time it's much better to use a proper HTML parser like this method than a regex.
Exactly my thoughts ... and I was going to suggest Mojo::DOM :-)
Thank you for passing on the Mojo::DOM love while including the video. They really should add that to the pod. +1
They should make more videos.
1

Based on what's given above, this will work, but something tells me you have more complicated HTML and/or you actually want \d+.

#!/usr/bin/perl
use strict;
use warnings;

while(<DATA>){
    print "$1\n" if />(\w+)</;
}

__DATA__
<H2
   align="justify"
  <FONT size="+2" color="#008AD9"><B>ACCESS_NUMBER<FONT size="+2" color="#008AD9"><B>S
  </H2

Comments

0

For each line remove html tags like this:

$l=~s/<.+?>/ /g; # Replace each tag with a space so you don't get run-on words. 

What you're left with is only text with no html tags.

I use software that uses tags (not html) which I have to remove, so I do this a lot.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.