0

I'm trying to figure out how to use look-ahead to try to capture the descriptive text in an html page such as

<div class="itemBanner" style="float:left; padding:10px">
<div style="padding-right:5px; padding-bottom:5px">
<div class="itemBanner">
HTML Tags Stripper is designed to strip HTML tags from the text. It will also strip embedded JavaScript code, style information (style sheets), as well as code inside php/asp tags (&lt;?php ?&gt; &lt;%php ?&gt; &lt;% %&gt;). It will also replace sequence of new line characters (multiple) with only one. <b>Allow tags</b> feature is session sticky, i.e. it will remember allowed tags list, so you will have to type them only once.<p></p>You can either provide text in text area below, or enter URL of the web page. If URL provided then HTML Tags Stripper will visit web-page for its contents.<p></p>
<b>Known issues:</b><br />

I figured a regex that looks for a '>' followed by at least 150 characters before a '<' would do the trick.

The closest I've gotten so far is:

(([^.<]){1,500})<

Which still misses on things like periods and other characters before and after the string.

3
  • Don't use regex to parse HTML: stackoverflow.com/a/1732454/2812842 Commented May 22, 2014 at 3:49
  • @scrowler He's not parsing, he's just capturing a block of text. Commented May 22, 2014 at 3:54
  • Parsing would be the right way to do it. In this case the HTML looks like XHTML, so you could use an XML parser. Commented May 22, 2014 at 4:11

1 Answer 1

1

Your regex will match anything that's neither "." nor "<" 1 to 500 times, then a "<".

Assuming you want to capture everything from the itemBanner div until the very next occurrence of a closing div, you can use these elements:

  • <div class="itemBanner"> - explicit match
  • () - parathentical wrap for referencing, e.g. match[1]
  • .*? - any length of characters, non-greedily (as few as possible)
  • <\/div> - explicit match, with escaped '/'

to form this Ruby regex:

item_banner_div_regex = /<div class="itemBanner">(.*?)<\/div>/
match = item_banner_div_regex.match(html)
inside_item_banner_div = match && match[1]

Note: The exact regex will depend on the implementation you're using.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.