27

I'm having some difficulty with a specific Regex I'm trying to use. I'm searching for every occurrence of a string (for my purposes, I'll say it's "mystring") in a document, EXCEPT where it's in a tag, e.g.

<a href="_mystring_">

should not match, but

<a href="someotherstring">_mystring_</a>

Should match, since it's not inside a tag (inside meaning "inside the < and > markers") I'm using .NET's regex functions for this as well.

4
  • do you mean your second example should not match? Commented Jun 5, 2009 at 20:55
  • 7
    [Insert obligatory "don't use regexes to parse HTML" answer here] Commented Jun 5, 2009 at 20:58
  • 1
    robbotic: no, it should match. It's not within the < and > markers. I need to do a replace on mystring but not when it's part of the tag as it is on the top example. Also, loading this into an XDocument or whatever isn't really doable in my situation. Commented Jun 5, 2009 at 21:16
  • 1
    Also I've tried to load up the strings into an XML document, but because most of the time they will not conform to the spec (not properly closed, a missing tag or two, etc) I can't use it Commented Jun 5, 2009 at 22:03

8 Answers 8

40

This should do it:

(?<!<[^>]*)_mystring_

It uses a negative look behind to check that the matched string does not have a < before it without a corresponding >

Sign up to request clarification or add additional context in comments.

Comments

15

Another regex to search that worked for me

(?![^<]*>)_mystring_

Source: https://stackoverflow.com/a/857819/1106878

Comments

14

When your regex processor doesn't support variable length look behind, try this:

(<.+?>[^<>]*?)(_mystring_)([^<>]*?<.+?>)

Preserve capture groups 1 and 3 and replace capture group 2:

For example, in Eclipse, find:

(<.+?>[^<>]*?)(_mystring_)([^<>]*?<.+?>)

and replace with:

$1_newString_$3

(Other regex processors might use a different capture group syntax, such as \1)

Comments

2

A quick and dirty alternative is to use a regex replace function with callback to encode the content of tags (everything between < and >), for example using base64, then run your search, then run another callback to decode your tag contents.

This can also save a lot of head scratching when you need to exclude specific tags from a regex search - first obfuscate them and wrap them in a marker that won't match your search, then run your search, then deobfuscate whatever is in markers.

Comments

1

Why use regex?

For xhtml, load it into XDocument / XmlDocument; for (non-x)html the Html Agility Pack would seem a more sensible choice...

Either way, that will parse the html into a DOM so you can iterate over the nodes and inspect them.

Comments

1
_mystring_(?![^<]*?>)

But a valid HTML structure is required.

Comments

0

Ignoring that are there indeed other ways, and that I'm no real regex expert, but one thing that popped into my head was:

  • find all the mystrings that ARE in tags first - because I can't write the expression to do the opposite :)
  • change those to something else
  • then replace all the other mystring (that are left not in tags) as you need
  • restore the original mystrings that were in tags

So, using <[^>]*?(mystring)[^>]*> you can find the tagged ones. Replace those with otherstring. Do you normal replace on the mystrings that are left. Replace otherstring back to mystring

Crude but effective....maybe.

Comments

-2

Regular expression searches are typically not a good idea in XML. It's too easy to run into problems with search expressions matching to much or too little. It's also almost impossible to formulate a regex that can correctly identify and handle CDATA sections, processing instructions (PIs), and escape sequences that XML allows.

Unless you have complete control over the XML content you're getting and can guarantee it won't include such constructs (and won't change) I would advise to use an XML parser of some kind (XDocument or XmlDocument in .net, for instance).

Having said that, if you're still intent on using regex as your search mechanism, something like the following should work using the RegEx class in .NET. You may want to test it out with some of your own test cases at a site like Regexlib. You may also be able to search their regular expression catalog to find something that might fit your needs.

[>].(_mystring_).[<]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.