0

I want to find and replace some of the *innerHTML *with regex. I want to do this across a document acting on everything except anchor elements.

I thought I could do this with queryselectorAll, by setting it to select all elements except anchor elements. The problem with that is that the anchor elements are nested within elements, as in the code below. So, even if I exclude anchor elements, I still traverse them in the regex because they are a nested within other elements (below, nested in a para element) that are caught by queryselector.

My next step was to try and exclude all elements that are the parent of an anchor element. But that results in HTML being missed from my regex search. For example, in the para element below the text "hello I am some text" is the *child *of the 'p' element. So, by excluding the 'p' element, that text falls outside the scope of my regex. I need that text to be included in my regex.

<p class="1 2">
  <span class="3">
   some writing here
    <strong class="4">some more here</strong>
  </span>
  <strong class="5">
    <span class="6">
      <span class="7"></span>
      <a class="8" href="#abc" title="TITLE" id="9">some text</a>
      <span class="10">some text</span>
      <span class="11"></span>
    </span>
    <span class="12"></span>
  </strong>
  hello I am some text
</p>

There are two further complexities. First, the document i need to traverse is very long, in the region of 250,000 words of HTML, all in a complex nested format perhaps 10 - 15 levels deep. Second, it is not a single regex I am running. I have an array of 300 regex. I need to traverse the document for every one of these 300 regex. The point being that it is quite resource intensive and time consuming. At the moment it takes about an hour to run my code. But that code is wrong because it acts on the anchor elements.

I thought of simply removing the anchor elements along the lines:

anchors.forEach((anchor) => anchor.parentNode.removeChild(anchor));

but then I am left with a document that lacks the anchor elements, and I need them in the document, i just don't want to traverse them with the regex. I thought of then recording the location of the deleted anchor elements and then reinserting them after the regex but it all gets very complex as I will be inserting new spans, thereby making it complex to track where the relevant anchor should be reinserted. This method just becomes too complex.

I would be grateful for suggestions as to how to proceed. Is there some way of avoiding traversing **nested **anchor elements?

EDIT 1. Apologies if my question wasn't clear and thank you for the really helpful responses. I've already learnt a lot. Here is some further explanation.

Here is another example bit of html

<p class="A B">
  <span class="H LegLHS F" id="123">
    <span class="D">
      <span class="C">(b)</span>
    </span>
  </span>
  <span class="H G LegP3Text">
    <a class="LegCitation" title="Go to item" rel="cite" href="/uk/directive/2020/0044">
      <span class="D">
        <span class="C">Directive 2020/44/UK</span>
      </span>
    </a>
    <span class="D">
      <span class="C"> UK law which is a directive </span>
    </span>
    <a class="Citation" title="Go to item" rel="cite" href="/uk/directive/2020/0044">
      <span class="D">
        <span class="C"> Directive 2020/44/UK </span>
      </span>
    </a>
    <span class="D">
      <span class="C">.</span>
      <span class="E"></span>
    </span>
  </span>
</p>

There are two issues I'm struggling with: (1) don’t want to act on anything between the opening < a> and the closing < /a>. I am trying to completely exclude anchor elements and anything within them. (2) The regex I am running acts on the innerhtml, rather than text nodes because I use the replace operation to wrap the found term in span class, like this: < span class=”xxxx ”>{match}</ >. So, for example, in the html above, suppose I search for the term “directive” I want to avoid matches within this anchor element.

    <a class="Citation" title="Go to item" rel="cite" href="/uk/directive/2020/0044">
      <span class="D">
        <span class="C"> Directive 2020/44/UK </span>
      </span>
    </a>

But I want to match the term “directive” in the below, because it is not the descendant of any anchor element.

      <span class="C"> UK law which is a directive </span>

Perhaps I am going about this wrong and there is some more elegant way of doing what I want to do. What I really want to do is search the text of the document for certain regex then wrap any hits in a new span. It doesn’t really matter whether I hit the text content of the anchor element, so long as I don’t hit the content like the href link. I’m just complexly at a loss as to why I no matter how I exclude anchor elements, changes are still made to the stuff between the opening and closing tags e.g. the href bit.

3
  • 1
    Can you explain "nested anchor elements" wouldn't all anchors be nested? Or do you mean nothing that's a child of an anchor. Would something like document.querySelectorAll("*:not(a):not(a *)") work? everything except anchors and descendents of anchors? Commented Jan 11, 2024 at 20:28
  • thank you for this, really useful. I've added some context and tried to better explain the issue. Really grateful if you had any further thoughts. Commented Jan 12, 2024 at 11:22
  • Did the selectors I provided solve your problem? If not, where are you still running into issues? Commented Jan 12, 2024 at 15:05

2 Answers 2

0

Selecto all non anchor elements, iterate their child nodes and change text nodes. The question is hard to comprehend though, it's not clear what exactly should be change inside anchors so I guess the child elements should be change too, added a span inside an anchor to show this. But better the OP should provide more an extended input HTML and add the desired output.

const p = document.querySelectorAll('*:not(a)')
  .forEach(el => [...el.childNodes]
    .forEach(node => node.nodeType === Node.TEXT_NODE && (node.textContent = node.textContent.replace(/some/g, 'any'))));
<p class="1 2">
  <span class="3">
   some writing here
    <strong class="4">some more here</strong>
  </span>
  <strong class="5">
    <span class="6">
      <span class="7"></span>
      <a class="8" href="#abc" title="TITLE" id="9">some text <span> i am some in an anchor</span></a>
      <span class="10">some text</span>
      <span class="11"></span>
    </span>
    <span class="12"></span>
  </strong>
  hello I am some text
</p>

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for taking a look at this. I tried the code. I don't think i explained the issue properly and I've made an edit [edit 1] to the main question to try and explain. I've spent so long on this it's ridiculous. Really grateful if you had any further thoughts.
@Jezza i think that could be XY problem - you could have selected a wrong approach with regexing HTML, that way you just need to work with pure HTML of the whole page with regexes. On the other hand why you need regex HTML if you can change the logic and for example find any element by tag with DOM? so again, the goal of your task is still unclear, it's not described what you want to achieve as the final result
the goal is to wrap certain terms in certain bits of a document with a particular class. The 'certain terms' are held in an array. The 'certain bits of a document' are the text content bits of the document not including any content between the opening and closing tags of an anchor element.
0

Try the following:

const p=document.querySelector("p"); // selector for top level parent element
[p,...p.querySelectorAll("*:not(a):not(a *)")]
 .forEach(e=>[...e.childNodes]
  .filter(n => n.nodeType == Node.TEXT_NODE)
  .forEach(n=>n.textContent=n.textContent.replace(/some/g,"lots of")));
<div>This "some" should not be changed.
<p class="1 2">But this "some" needs to be replaced.
  <span class="3">
   some writing here
<strong class="4">some more here</strong>
  </span>
  <strong class="5">
<span class="6">
  <span class="7"></span>
  <a class="8" href="#abc" title="TITLE" id="9">some text <i>and some italic text</i></a>
  <span class="10">some text</span>
  <span class="11"></span>
</span>
<span class="12"></span>
  </strong>
  hello I am some text
</p>
</div>

Starting with the parent <p> element the .querySelectorAll("*:not(a)") collection will contain all none-a elements. The text nodes within the child nodes of each one of these elements are then processed further. In each of their .textContents the string "some" will be replaced by "lots of".

1. Update:
While the HTML provided by OP does not strictly require it, @mykaf's suggested selector *:not(a):not(a *) (see comment under the question) would be necessary if we want to exclude anything within <a> tags.

2. Update:
After OP updated their question it has now become clear that the intention was to wrap any rendered text on the page in a <span class="xxx"> element.

This can be achieved most easily by applying a regular expression on the .innerHTML of the page's body:

document.body.innerHTML=document.body.innerHTML.replace(/(?<=^|>)[^<]+/gm,t=>
 t.replace(/directive/ig,'<span class="xxx">$&</span>'));
.xxx {background-color:#8f8}
Some initial text with the word directive in it.
<p class="A B">
  <span class="H LegLHS F" id="123">
<span class="D">
  <span class="C">(b)</span>
</span>
  </span>
  <span class="H G LegP3Text">
<a class="LegCitation" title="Go to item" rel="cite" href="/uk/directive/2020/0044">
  <span class="D">
    <span class="C">Directive 2020/44/UK</span>
  </span>
</a>
<span class="D">
  <span class="C"> UK law which is a directive </span>
</span>
<a class="Citation" title="Go to item" rel="cite" href="/uk/directive/2020/0044">
  <span class="D">
    <span class="C"> Directive 2020/44/UK </span>
  </span>
</a>
<span class="D">
  <span class="C">.</span>
  <span class="E"></span>
</span>
  </span>
</p>

The regular expression /(?<=^|>)[^<]+/gm will find any text string between the characters > or the beginning (^) of the string and <. The matched fragments will then be passed to the replace-callback function and a series of .replace() methods on that fragment will do the actual wrapping (in my snippet there is only a single replace in action).

2 Comments

thank you, i have edited the main question to try and explain the issue better. Thanks for taking a looking and helping.
I found a regular expression based way of achieving your objective.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.