2

I'm looking for a regular expressions string that can find a word or regex string NOT between html tags.

Say I want to replace (alpha|beta) in: the first two letters in the greek alphabet are alpha and <b>beta</b>

I only want it to replace alpha, because beta is between <> tags. So ignore (<(.*?)>(.*?)<\/(.*?)>)

:)

6
  • Consider using the code{} button when writing your question Commented Apr 16, 2011 at 18:19
  • Sorry, just joined this site. Will use it in the future. :) Commented Apr 16, 2011 at 18:22
  • It's ok :) it's just if you try to use tags it might not work without the code wrapper. Commented Apr 16, 2011 at 18:40
  • 5
    It looks to me everything is between tags in html. Commented Apr 16, 2011 at 18:48
  • @sln, I mean on one line. Limited between \r\n at the beginning and end. Commented Apr 17, 2011 at 0:15

2 Answers 2

3

I didn't test the logic used in this page - http://www.phpro.org/examples/Get-Text-Between-Tags.html But I can confirm the logical point made at the top of the page in big bold letters that says you shouldn't do what you're trying to do with regex.

Html is not uniform and edge cases will always bite you in the rear if you use regular expressions to handle the content of those tags in any real world situation. So unless your markup is extremely simplistic, uniform, 100% accurate, only contains html (not css, javascript or garbage) then your best bet is a dom parser library.

And really many dom parser libraries have problems too but you'll be miles ahead of the regex counterparts. The best way to get the text contet of tags is to render the html in a browser and access the innerText property of the given dom node (or have a human copy and paste the contents out manually) - but that isn't always an option :D

Sign up to request clarification or add additional context in comments.

Comments

1

It's maybe the 'wrong' way, but it works: when I need to do something similar, I first do a preg_replace_callback to find what I don't want to match and encode it with something like base64.

Then I can happily run an ordinary preg_replace on the result, knowing that it has no chance of matching the strings I want to ignore. Then unscramble using the same pattern in preg_replace_callback, this time sending the matches to be base64 decoded.

I often do this when automatically adding keyword or glossary links or tooltips to a text - I scramble the HTML tags themselves so that I don't try to create a link or a tooltip within the title of an anchor tag or somewhere equally ridiculous, for example.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.