Using a Perl regex to print multi-line patterns from an HTML file

Question

I have an HTML file. Here is a sample

      <div class="criteria" style="padding-left:0;font-style:italic">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;You searched for: 
        <span title="A*" >Individual: <span><b>A*</b></span></span>
      </div>

    </td>

  </tr>

</table>

<table cellpadding="5" cellspacing="0" border="0" style="border-collapse: collapse; width: 100%">

  <tr class="ListItemColorNew">

    <td style="width:50%">
      <div class="gvListItemStyle">
        <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
        <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>
        <div class="GrayTextShade">
          GREY TIDE LLC (LIC# 2222) 
        </div>
      </div>
    </td>

    <td style="width:50%">
      <div class="gvListItemStyle">
        <span class="LargeText15">FRANK WHITE A&#39;SMALLS </span> (LIC# 1111111)
        <div class="GrayTextShade"><i>Alternate Names: JAMES SMALLS</i></div>
        <div class="GrayTextShade">
          WEST RIVER CORP LLC (LIC# 3333) 
        </div>
      </div>
    </td>


    <td style="width: 25%; vertical-align: top">
      <div class="gvListItemStyle">
        <div><img alt="help"  src=\'/Content/images/BrokerCheck/icon-blueCheck.png\'    style=\'vertical-align:top;padding-right:5px\' />Broker</div>
        </div>
    </td>

    <td style="width:25%;text-align:right;vertical-align:top">
      <div class="gvListItemStyle">
        <a class="btn btn-primary" href="/Individual/Summary/5820616">Details &#187;</a>        </div>
    </td>

  </tr>

I'm trying to extract everything between <td style="width:50%"> and </td>. The data is stored in a file testFile.txt.

This is the Perl code I used

 system("perl -pi.bak -e '/^<td style=\"width:50%\">.+<\\/td>/mg' testFile.txt";

And here's the obligatory "don't parse HTML with regex" comment. — theftprevention
– theftprevention, Commented Sep 7, 2014 at 15:09
^ matches the start of the line. Unless the <td> element is right at the start of the line (i.e. no whitespace before it), you won't get any matches from your current regex. — i alarmed alien
– i alarmed alien, Commented Sep 7, 2014 at 15:09
Use some html parser such Mojo::DOM or Web::Scraper e.g perl -Mojo -E 'say $_ for x(b("file.html")->slurp)->find(q{td[style="width:50%"]})' — clt60
– clt60, Commented Sep 7, 2014 at 15:20
You haven't tried very hard if your attempts are just a single line of Perl — Borodin
– Borodin, Commented Sep 7, 2014 at 21:28
Just because you see one line of Perl doesn't mean that's the only thing I've tried. Your last comment is completely unhelpful. — SonOfSeuss
– SonOfSeuss, Commented Sep 8, 2014 at 14:12

Miller · Accepted Answer · 2014-09-07 16:52:13Z

1

Your below code isn't actually doing anything:

system("perl -pi.bak -e '/^<td style=\"width:50%\">.+<\\/td>/mg' testFile.txt");

You're matching m// in a void context with no captures, so the executed statement is meaningless.
Your pattern will never match your content because:

a. You're using the any character ., but it won't match newlines unless you use the /s Modifier.

b. You're using -p for line by line processing of the file, but your pattern would need to span lines in order to match.

The following demonstrates both a regex solution (not recommended) and using an actual HTML Parser, in this case Mojo::DOM. For a helpful 8 minute introductory video, check out Mojocast Episode 5

use strict;
use warnings;

use Mojo::DOM;

my $data = do { local $/; <DATA> };

# Regex Solution:
if ( $data =~ m{<td style="width:50%">(.*?)</td>}s ) {
    print "Regex Solution:\n$1";
} else {
    warn "No pattern match found";
}

# Parser Solution:
my $dom = Mojo::DOM->new($data);

my $yourtd = $dom->at(q{td[style="width:50%"]})->content;

print "\nMojo::DOM:\n", $yourtd;

__DATA__
<html>
<head>
<title>Hello World</title>
</head>
<body>
<table>
    <tr>
        </td>
            <div class="criteria" style="padding-left:0;font-style:italic">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;You searched for: 
            <span title="A*" >Individual: <span><b>A*</b></span></span>
            </div>

        </td>
    </tr>
</table>

<table cellpadding="5" cellspacing="0" border="0" style="border-collapse: collapse; width: 100%">

    <tr class="ListItemColorNew">
        <td style="width:50%">
            <div class="gvListItemStyle">
                <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
                <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>

                <div class="GrayTextShade">
                GREY TIDE LLC (LIC# 2222) 
                </div>
            </div>
        </td>
        <td style="width: 25%; vertical-align: top">
            <div class="gvListItemStyle">
            <div><img alt="help"  src=\'/Content/images/BrokerCheck/icon-blueCheck.png\'    style=\'vertical-align:top;padding-right:5px\' />Broker</div>
            </div>
        </td>
        <td style="width:25%;text-align:right;vertical-align:top">
            <div class="gvListItemStyle">
            <a class="btn btn-primary" href="/Individual/Summary/5820616">Details &#187;</a>        </div>
            </td>
    </tr>
<table>
</body>
</html>

Outputs:

Regex Solution:

            <div class="gvListItemStyle">
                <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
                <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>

                <div class="GrayTextShade">
                GREY TIDE LLC (LIC# 2222) 
                </div>
            </div>

Mojo::DOM:

            <div class="gvListItemStyle">
                <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
                <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>

                <div class="GrayTextShade">
                GREY TIDE LLC (LIC# 2222) 
                </div>
            </div>

answered Sep 7, 2014 at 16:52

Miller

35.3k4 gold badges42 silver badges61 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

SonOfSeuss Over a year ago

This solution only seems to work for one div section. It fails at the second section.

Miller Over a year ago

Your sample data and problem description only talked about one td with a width of 50%. If there is more to the problem it's up to you to include it in your question. Either way, the tools you will need are demonstrated above. Reading the docs and watching the linked video will go a long way toward teaching you how to adapt the solution yourself.

SonOfSeuss Over a year ago

If you guys want to get multiple sections with the same input then this is the way it's done while( $data =~ m{<td style="width:50%">(.*?)</td>}gs) {print "\n$1";}.

vks · Accepted Answer · 2014-09-07 15:18:47Z

0

  .*?(<td style="width:50%">((?!<\/td>).)*?<\/td>)

See demo.Use gs flags.

See demo.

http://regex101.com/r/oC3nN4/15

answered Sep 7, 2014 at 15:18

vks

68.1k11 gold badges96 silver badges132 bronze badges

4 Comments

SonOfSeuss Over a year ago

Unmatched ( in regex; marked by <-- HERE in m/.*?(<td style="width:50%">(( <-- HERE ?!</ at -e line 1.

vks Over a year ago

@SonOfSeuss there's only one tag with width 50 and as u can see in demo it does match.Are you sure you are putting correct flags.Dont use m flag.

SonOfSeuss Over a year ago

I think your solution is great in the simulator and it works. I tested it with more data. I think the implementation in Perl is the problem. Here's your solution: system("perl -pi.bak -e '/.*?(<td style=\"width:50%\">((?!<\/td>).)*?<\/td>)/gs' testFile.txt"); Here's the error: Unmatched ( in regex; marked by <-- HERE in m/.*?(<td style="width:50%">(( <-- HERE ?!</ at -e line 1. I fixed this by escaping the double quotes but it still has no results in the textFile.txt.

vks Over a year ago

@SonOfSeuss ... the regex is a perl regex but i got no knowledge of perl :( .......

Fred Sullet · Accepted Answer · 2014-09-07 15:25:48Z

0

As said in the comments, remove the ^ in your regexp.

Also, use /s instead of /mg if you want to treat the file content as a single line string which allows '.' pattern to allow match new line characters '\n'.

/<td style=\"width:50%\">.+?<\\/td>/s

.+? while stop the matching at the first occurrence of </td>, not the last

answered Sep 7, 2014 at 15:25

Fred Sullet

3711 gold badge7 silver badges19 bronze badges

2 Comments

SonOfSeuss Over a year ago

Not working. No bugs but simply doesn't work. The output is a file the exact same as testFile.txt. Thanks though.

Fred Sullet Over a year ago

perl -e 'open(FILE, "testFile.txt"); @data=<FILE>; $string=join("\n",@data); print $1 if $string=~/(<td style="width:50%">.+?<\/td>)/s;'

Borodin · Accepted Answer · 2014-09-07 21:44:04Z

I hope you've seen previous advice to avoid regexes to process HTML? It's really true! The only excuse I can think of for avoiding one of the several excellent HTML modules is that your data is so malformed that nothing else will process it.

Your "sample" of your HTML file is particularly unhelpful. Before I fixed the indentation the lines were scattered all over the place. After I looked at it I saw that it was the end of one table element followed by the start of another, so it left several elements unbalanced and either closed but not opened or vice-versa. Please don't do that to us.

I built a well-formed HTML file that contains your extract, and this is a program that will process it that uses HTML::TreeBuilder

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_file('html.html');
my @td50 = $tree->look_down(_tag => 'td', style => 'width:50%');
print $_->as_HTML('<>&', '  '), "\n\n" for @td50;

output

<td style="width:50%">
  <div class="gvListItemStyle"><span class="LargeText15">JAMES BOND A'MONEYPENNY </span> (LIC# 1111111) <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>
    <div class="GrayTextShade"> GREY TIDE LLC (LIC# 2222) </div>
  </div>
</td>

In case you or others need it, here's the HTML input document that I used

<html>
  <body>

    <table>
      <tr>
        <td>
          <div class="criteria" style="padding-left:0;font-style:italic">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;You searched for: 
            <span title="A*" >Individual: <span><b>A*</b></span></span>
          </div>
        </td>
      </tr>
    </table>

    <table cellpadding="5" cellspacing="0" border="0" style="border-collapse: collapse; width: 100%">
      <tr class="ListItemColorNew">

        <td style="width:50%">
          <div class="gvListItemStyle">
            <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
            <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>
            <div class="GrayTextShade">
              GREY TIDE LLC (LIC# 2222) 
            </div>
          </div>
        </td>

        <td style="width: 25%; vertical-align: top">
          <div class="gvListItemStyle">
            <div><img alt="help"  src=\'/Content/images/BrokerCheck/icon-blueCheck.png\'    style=\'vertical-align:top;padding-right:5px\' />Broker</div>
            </div>
        </td>

        <td style="width:25%;text-align:right;vertical-align:top">
          <div class="gvListItemStyle">
            <a class="btn btn-primary" href="/Individual/Summary/5820616">Details &#187;</a>        </div>
        </td>

      </tr>
    </table>
  </body>
</html>

Collectives™ on Stack Overflow

Using a Perl regex to print multi-line patterns from an HTML file

4 Answers 4

3 Comments

4 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

4 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related