I have a document with a 100 thousand lines of html filled with <tr> ... </tr> tags. somewhere inside every one of these multi-line tag sets is an element with the word "purpose", that is except one. I need to find the page long string that starts with <tr> and has a bunch of characters before the ending </tr> tag and has no instance of the string "purpose" within that tag set. I am working with Notepad++ v7 search with Regex and matches newline. Matching the tr string is easy by searching on <tr>(.*?)</tr>
This matches one and only one set of tags with all of the text in between. What I CAN'T do is find expression that finds this string that doesn't have "purpose" in it. I have tried <tr>(?!.*?"purpose")(.*?)</tr> which will find the first tr string after the last one that contains "purpose" (yes I need to include the quotes) and many variations and read regex negative look ahead and behind tutorials but to no avail. I have many similar problems with this text missing stuff, so thanks very much in advance if someone has a clue of how to do this!!!
1 Answer
This should do the trick:
<tr>((?!"purpose").)*?</tr>
It, essentially,
- Finds the opening tag and steps to the character just afterward.
- Checks to make sure it and the consecutive characters don't match "purpose" (including quotes)
- Steps forward one character, and if it hasn't reached the ending tag, returns to 2.
- Stops on the ending tag.
4 Comments
Chris Morgan
You are a genius. It Works! But I don't understand the purpose of the .)*? characters exactly and why it makes this work. the "." is for any character, the "*" is for multiple, and I really don't understand what the "?" does in this case as a practical matter. What does the regex engine try to do if you take one of these three out?
Chris Morgan
Well, almost. I fixed the missing "purpose" section of that tr row, reran expression and it didn't find any others. Great. On to next problem with document: Missing "syntax" in some of the tr rows. I took same expression, substituted the word syntax for the word purpose with no other changes and now clicking find selects the ENTIRE DOCUMENT. The document does not even begin or end with one of these tags, so I have no clue why it is doing this, consistently. Help?
Chris Morgan
More info: depending on the missing term, sometimes it finds that tag string as desired, and sometimes it is just selecting the entire document, even when it is clear that the term IS missing (I can count the number of occurrences of a term, and it is less than the number of tr tag sets.) Idea?
Somdudewillson
The "." does step 3, stepping the regex forward one character. The "*" repeats steps 2 and 3, and the "?" makes sure it exists out of that loop and proceeds to step 4 when it reaches an ending tag.
<tr>((?!\bpurpose\b).)*?</tr>