1

I have an HTML file. Here is a sample

      <div class="criteria" style="padding-left:0;font-style:italic">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;You searched for: 
        <span title="A*" >Individual: <span><b>A*</b></span></span>
      </div>

    </td>

  </tr>

</table>

<table cellpadding="5" cellspacing="0" border="0" style="border-collapse: collapse; width: 100%">

  <tr class="ListItemColorNew">

    <td style="width:50%">
      <div class="gvListItemStyle">
        <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
        <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>
        <div class="GrayTextShade">
          GREY TIDE LLC (LIC# 2222) 
        </div>
      </div>
    </td>

    <td style="width:50%">
      <div class="gvListItemStyle">
        <span class="LargeText15">FRANK WHITE A&#39;SMALLS </span> (LIC# 1111111)
        <div class="GrayTextShade"><i>Alternate Names: JAMES SMALLS</i></div>
        <div class="GrayTextShade">
          WEST RIVER CORP LLC (LIC# 3333) 
        </div>
      </div>
    </td>


    <td style="width: 25%; vertical-align: top">
      <div class="gvListItemStyle">
        <div><img alt="help"  src=\'/Content/images/BrokerCheck/icon-blueCheck.png\'    style=\'vertical-align:top;padding-right:5px\' />Broker</div>
        </div>
    </td>

    <td style="width:25%;text-align:right;vertical-align:top">
      <div class="gvListItemStyle">
        <a class="btn btn-primary" href="/Individual/Summary/5820616">Details &#187;</a>        </div>
    </td>

  </tr>

I'm trying to extract everything between <td style="width:50%"> and </td>. The data is stored in a file testFile.txt.

This is the Perl code I used

 system("perl -pi.bak -e '/^<td style=\"width:50%\">.+<\\/td>/mg' testFile.txt";
5
  • 1
    And here's the obligatory "don't parse HTML with regex" comment. Commented Sep 7, 2014 at 15:09
  • ^ matches the start of the line. Unless the <td> element is right at the start of the line (i.e. no whitespace before it), you won't get any matches from your current regex. Commented Sep 7, 2014 at 15:09
  • 2
    Use some html parser such Mojo::DOM or Web::Scraper e.g perl -Mojo -E 'say $_ for x(b("file.html")->slurp)->find(q{td[style="width:50%"]})' Commented Sep 7, 2014 at 15:20
  • You haven't tried very hard if your attempts are just a single line of Perl Commented Sep 7, 2014 at 21:28
  • Just because you see one line of Perl doesn't mean that's the only thing I've tried. Your last comment is completely unhelpful. Commented Sep 8, 2014 at 14:12

4 Answers 4

1

Your below code isn't actually doing anything:

system("perl -pi.bak -e '/^<td style=\"width:50%\">.+<\\/td>/mg' testFile.txt");
  1. You're matching m// in a void context with no captures, so the executed statement is meaningless.

  2. Your pattern will never match your content because:

    a. You're using the any character ., but it won't match newlines unless you use the /s Modifier.

    b. You're using -p for line by line processing of the file, but your pattern would need to span lines in order to match.

The following demonstrates both a regex solution (not recommended) and using an actual HTML Parser, in this case Mojo::DOM. For a helpful 8 minute introductory video, check out Mojocast Episode 5

use strict;
use warnings;

use Mojo::DOM;

my $data = do { local $/; <DATA> };

# Regex Solution:
if ( $data =~ m{<td style="width:50%">(.*?)</td>}s ) {
    print "Regex Solution:\n$1";
} else {
    warn "No pattern match found";
}

# Parser Solution:
my $dom = Mojo::DOM->new($data);

my $yourtd = $dom->at(q{td[style="width:50%"]})->content;

print "\nMojo::DOM:\n", $yourtd;

__DATA__
<html>
<head>
<title>Hello World</title>
</head>
<body>
<table>
    <tr>
        </td>
            <div class="criteria" style="padding-left:0;font-style:italic">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;You searched for: 
            <span title="A*" >Individual: <span><b>A*</b></span></span>
            </div>

        </td>
    </tr>
</table>

<table cellpadding="5" cellspacing="0" border="0" style="border-collapse: collapse; width: 100%">

    <tr class="ListItemColorNew">
        <td style="width:50%">
            <div class="gvListItemStyle">
                <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
                <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>

                <div class="GrayTextShade">
                GREY TIDE LLC (LIC# 2222) 
                </div>
            </div>
        </td>
        <td style="width: 25%; vertical-align: top">
            <div class="gvListItemStyle">
            <div><img alt="help"  src=\'/Content/images/BrokerCheck/icon-blueCheck.png\'    style=\'vertical-align:top;padding-right:5px\' />Broker</div>
            </div>
        </td>
        <td style="width:25%;text-align:right;vertical-align:top">
            <div class="gvListItemStyle">
            <a class="btn btn-primary" href="/Individual/Summary/5820616">Details &#187;</a>        </div>
            </td>
    </tr>
<table>
</body>
</html>

Outputs:

Regex Solution:

            <div class="gvListItemStyle">
                <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
                <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>

                <div class="GrayTextShade">
                GREY TIDE LLC (LIC# 2222) 
                </div>
            </div>

Mojo::DOM:

            <div class="gvListItemStyle">
                <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
                <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>

                <div class="GrayTextShade">
                GREY TIDE LLC (LIC# 2222) 
                </div>
            </div>
Sign up to request clarification or add additional context in comments.

3 Comments

This solution only seems to work for one div section. It fails at the second section.
Your sample data and problem description only talked about one td with a width of 50%. If there is more to the problem it's up to you to include it in your question. Either way, the tools you will need are demonstrated above. Reading the docs and watching the linked video will go a long way toward teaching you how to adapt the solution yourself.
If you guys want to get multiple sections with the same input then this is the way it's done while( $data =~ m{<td style="width:50%">(.*?)</td>}gs) {print "\n$1";}.
0
  .*?(<td style="width:50%">((?!<\/td>).)*?<\/td>)

See demo.Use gs flags.

See demo.

http://regex101.com/r/oC3nN4/15

4 Comments

Unmatched ( in regex; marked by <-- HERE in m/.*?(<td style="width:50%">(( <-- HERE ?!</ at -e line 1.
@SonOfSeuss there's only one tag with width 50 and as u can see in demo it does match.Are you sure you are putting correct flags.Dont use m flag.
I think your solution is great in the simulator and it works. I tested it with more data. I think the implementation in Perl is the problem. Here's your solution: system("perl -pi.bak -e '/.*?(<td style=\"width:50%\">((?!<\/td>).)*?<\/td>)/gs' testFile.txt"); Here's the error: Unmatched ( in regex; marked by <-- HERE in m/.*?(<td style="width:50%">(( <-- HERE ?!</ at -e line 1. I fixed this by escaping the double quotes but it still has no results in the textFile.txt.
@SonOfSeuss ... the regex is a perl regex but i got no knowledge of perl :( .......
0

As said in the comments, remove the ^ in your regexp.

Also, use /s instead of /mg if you want to treat the file content as a single line string which allows '.' pattern to allow match new line characters '\n'.

/<td style=\"width:50%\">.+?<\\/td>/s

.+? while stop the matching at the first occurrence of </td>, not the last

2 Comments

Not working. No bugs but simply doesn't work. The output is a file the exact same as testFile.txt. Thanks though.
perl -e 'open(FILE, "testFile.txt"); @data=<FILE>; $string=join("\n",@data); print $1 if $string=~/(<td style="width:50%">.+?<\/td>)/s;'
0

I hope you've seen previous advice to avoid regexes to process HTML? It's really true! The only excuse I can think of for avoiding one of the several excellent HTML modules is that your data is so malformed that nothing else will process it.

Your "sample" of your HTML file is particularly unhelpful. Before I fixed the indentation the lines were scattered all over the place. After I looked at it I saw that it was the end of one table element followed by the start of another, so it left several elements unbalanced and either closed but not opened or vice-versa. Please don't do that to us.

I built a well-formed HTML file that contains your extract, and this is a program that will process it that uses HTML::TreeBuilder

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_file('html.html');
my @td50 = $tree->look_down(_tag => 'td', style => 'width:50%');
print $_->as_HTML('<>&', '  '), "\n\n" for @td50;

output

<td style="width:50%">
  <div class="gvListItemStyle"><span class="LargeText15">JAMES BOND A'MONEYPENNY </span> (LIC# 1111111) <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>
    <div class="GrayTextShade"> GREY TIDE LLC (LIC# 2222) </div>
  </div>
</td>

In case you or others need it, here's the HTML input document that I used

<html>
  <body>

    <table>
      <tr>
        <td>
          <div class="criteria" style="padding-left:0;font-style:italic">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;You searched for: 
            <span title="A*" >Individual: <span><b>A*</b></span></span>
          </div>
        </td>
      </tr>
    </table>

    <table cellpadding="5" cellspacing="0" border="0" style="border-collapse: collapse; width: 100%">
      <tr class="ListItemColorNew">

        <td style="width:50%">
          <div class="gvListItemStyle">
            <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
            <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>
            <div class="GrayTextShade">
              GREY TIDE LLC (LIC# 2222) 
            </div>
          </div>
        </td>

        <td style="width: 25%; vertical-align: top">
          <div class="gvListItemStyle">
            <div><img alt="help"  src=\'/Content/images/BrokerCheck/icon-blueCheck.png\'    style=\'vertical-align:top;padding-right:5px\' />Broker</div>
            </div>
        </td>

        <td style="width:25%;text-align:right;vertical-align:top">
          <div class="gvListItemStyle">
            <a class="btn btn-primary" href="/Individual/Summary/5820616">Details &#187;</a>        </div>
        </td>

      </tr>
    </table>
  </body>
</html>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.