Extracting text between HTML tags using perl

Question

I have HTML tags in a document as follows:

><H2 
align="justify"
><FONT size="+2" color="#008AD9"><B>ACCESS_NUMBER<FONT size="+2" color="#008AD9"><B>
</H2
>

I want to extract only ACCESS_NUMBER from the above HTML text.

How can I do this? I want to make sure only the text between all <H2> tags is extracted. Any help would be appreciated.

What have you tried? Also is this literally how the code appears (i.e with < strangely wrapped) in the document, or did you improperly format the code section of your question? I did a bit of editing to make it all appear but otherwise left it as it was. The format of the code would effect a regular expression. — G. Cito
– G. Cito, Commented Jun 21, 2014 at 13:57
A parser might be able to extract the text you want if the document is consistent with some definition of "HTML" :-) — G. Cito
– G. Cito, Commented Jun 21, 2014 at 13:58

Chankey Pathak · Accepted Answer · 2014-06-21 09:09:44Z

4

Use Mojo::DOM

#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;
my $HTML = <<"EOF";
<html>
<head>
<title>Test</title>
</head>
<body>
<h2>
<font><b>ACCESS_NUMBER</b></font> 
</h2>
</body>
</html>
EOF

my $dom = Mojo::DOM->new( $HTML );
print $dom->find('h2 font b')->text;

For a 8 minute video tutorial on Mojo::DOM and Mojo::UserAgent check out Mojocast Episode 5

answered Jun 21, 2014 at 9:09

Chankey Pathak

21.8k12 gold badges88 silver badges138 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Dr.Avalanche Over a year ago

Upvoted. 99.9% of the time it's much better to use a proper HTML parser like this method than a regex.

G. Cito Over a year ago

Exactly my thoughts ... and I was going to suggest Mojo::DOM :-)

Miller Over a year ago

Thank you for passing on the Mojo::DOM love while including the video. They really should add that to the pod. +1

Chankey Pathak Over a year ago

They should make more videos.

fin · Accepted Answer · 2014-06-21 09:57:47Z

1

Based on what's given above, this will work, but something tells me you have more complicated HTML and/or you actually want \d+.

#!/usr/bin/perl
use strict;
use warnings;

while(<DATA>){
    print "$1\n" if />(\w+)</;
}

__DATA__
<H2
   align="justify"
  <FONT size="+2" color="#008AD9"><B>ACCESS_NUMBER<FONT size="+2" color="#008AD9"><B>S
  </H2

answered Jun 21, 2014 at 9:57

fin

3331 silver badge4 bronze badges

Comments

Bulrush · Accepted Answer · 2014-06-22 13:31:31Z

0

For each line remove html tags like this:

$l=~s/<.+?>/ /g; # Replace each tag with a space so you don't get run-on words.

What you're left with is only text with no html tags.

I use software that uses tags (not html) which I have to remove, so I do this a lot.

answered Jun 22, 2014 at 13:31

Bulrush

5782 gold badges5 silver badges21 bronze badges

Collectives™ on Stack Overflow

Extracting text between HTML tags using perl

3 Answers 3

4 Comments

Comments

For each line remove html tags like this:

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

For each line remove html tags like this:

Comments

Your Answer

Sign up or log in

Post as a guest

Related