I want to find and replace some of the *innerHTML *with regex. I want to do this across a document acting on everything except anchor elements.
I thought I could do this with queryselectorAll, by setting it to select all elements except anchor elements. The problem with that is that the anchor elements are nested within elements, as in the code below. So, even if I exclude anchor elements, I still traverse them in the regex because they are a nested within other elements (below, nested in a para element) that are caught by queryselector.
My next step was to try and exclude all elements that are the parent of an anchor element. But that results in HTML being missed from my regex search. For example, in the para element below the text "hello I am some text" is the *child *of the 'p' element. So, by excluding the 'p' element, that text falls outside the scope of my regex. I need that text to be included in my regex.
<p class="1 2">
<span class="3">
some writing here
<strong class="4">some more here</strong>
</span>
<strong class="5">
<span class="6">
<span class="7"></span>
<a class="8" href="#abc" title="TITLE" id="9">some text</a>
<span class="10">some text</span>
<span class="11"></span>
</span>
<span class="12"></span>
</strong>
hello I am some text
</p>
There are two further complexities. First, the document i need to traverse is very long, in the region of 250,000 words of HTML, all in a complex nested format perhaps 10 - 15 levels deep. Second, it is not a single regex I am running. I have an array of 300 regex. I need to traverse the document for every one of these 300 regex. The point being that it is quite resource intensive and time consuming. At the moment it takes about an hour to run my code. But that code is wrong because it acts on the anchor elements.
I thought of simply removing the anchor elements along the lines:
anchors.forEach((anchor) => anchor.parentNode.removeChild(anchor));
but then I am left with a document that lacks the anchor elements, and I need them in the document, i just don't want to traverse them with the regex. I thought of then recording the location of the deleted anchor elements and then reinserting them after the regex but it all gets very complex as I will be inserting new spans, thereby making it complex to track where the relevant anchor should be reinserted. This method just becomes too complex.
I would be grateful for suggestions as to how to proceed. Is there some way of avoiding traversing **nested **anchor elements?
EDIT 1. Apologies if my question wasn't clear and thank you for the really helpful responses. I've already learnt a lot. Here is some further explanation.
Here is another example bit of html
<p class="A B">
<span class="H LegLHS F" id="123">
<span class="D">
<span class="C">(b)</span>
</span>
</span>
<span class="H G LegP3Text">
<a class="LegCitation" title="Go to item" rel="cite" href="/uk/directive/2020/0044">
<span class="D">
<span class="C">Directive 2020/44/UK</span>
</span>
</a>
<span class="D">
<span class="C"> UK law which is a directive </span>
</span>
<a class="Citation" title="Go to item" rel="cite" href="/uk/directive/2020/0044">
<span class="D">
<span class="C"> Directive 2020/44/UK </span>
</span>
</a>
<span class="D">
<span class="C">.</span>
<span class="E"></span>
</span>
</span>
</p>
There are two issues I'm struggling with: (1) don’t want to act on anything between the opening < a> and the closing < /a>. I am trying to completely exclude anchor elements and anything within them. (2) The regex I am running acts on the innerhtml, rather than text nodes because I use the replace operation to wrap the found term in span class, like this: < span class=”xxxx ”>{match}</ >. So, for example, in the html above, suppose I search for the term “directive” I want to avoid matches within this anchor element.
<a class="Citation" title="Go to item" rel="cite" href="/uk/directive/2020/0044">
<span class="D">
<span class="C"> Directive 2020/44/UK </span>
</span>
</a>
But I want to match the term “directive” in the below, because it is not the descendant of any anchor element.
<span class="C"> UK law which is a directive </span>
Perhaps I am going about this wrong and there is some more elegant way of doing what I want to do. What I really want to do is search the text of the document for certain regex then wrap any hits in a new span. It doesn’t really matter whether I hit the text content of the anchor element, so long as I don’t hit the content like the href link. I’m just complexly at a loss as to why I no matter how I exclude anchor elements, changes are still made to the stuff between the opening and closing tags e.g. the href bit.
document.querySelectorAll("*:not(a):not(a *)")work? everything except anchors and descendents of anchors?