0

I know "Dont use regex for html", but seriously, loading an entire html parser isn't always an option.

So, here is the scenario

<script...>
    some stuff
</script>

<script...>
    var stuff = '<';
    anchortext
</script>

If you do this:

<script[^>]*?>.*?anchor.*?</script>

You will capture from the first script tag to the /script in the second block. Is there a way to do a .*? but by replacing the . with a match block, something like:

<script[^>]*?>(^</script>)*?anchor.*?</script>

I looked at negative lookaheads etc, but I can't get something to work properly. Usually I just use [^>]*? to avoid running past the closing block, but in this particular example, the script content has a "<" in it, and it stops matching on that before reaching the anchortext.

To simplify, I need something like [^z]*? but instead of a single character or character range, I need a capture group to fit a string.

.*?(?!z) doesn't have the same effect as [^z]*? as I assumed it would.

Here is where I am stuck at: http://regexr.com?34llp

7
  • 3
    stackoverflow.com/a/1732454/500202 Commented Apr 24, 2013 at 19:21
  • 1
    Duplicate of around a million other roughly identical questions on StackOverflow. Commented Apr 24, 2013 at 19:24
  • So what exactly do you want to capture? Commented Apr 24, 2013 at 19:27
  • 1
    "loading an entire html parser isn't always an option" What is the specific reason with the current issue you are trying to solve for which DOMDocument is not an option? Commented Apr 24, 2013 at 19:29
  • @PeeHaa埽 Speed issues, loading the php_simple_dom_parser object concurrently for thousands of simultaneous documents takes quite an overhead hit. Commented Apr 24, 2013 at 19:41

2 Answers 2

3

Match-anything-but is indeed commonly implemented with a negative lookahead:

 ((?!exclude).)*?

The trick is to not have the . dot repeated. But make it successively match any character while ensuring that character is not the beginning of the excluded word.

In your case you would want to have this instead of the initial .*?

 <script[^>]*?>((?!</script>).)*?anchor.*?</script>
Sign up to request clarification or add additional context in comments.

Comments

0

like that:

$pattern = '~<script[^>]*+>((?:[^<]+?|<++(?!/script>))*?\banchor(?:[^<]+?|<++(?!/script>))*+)</script>~';

But DOM is the better way as far to do that.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.