I have a problem where I need to cut off text based on a max length. But the inputted string could be html a la <p>hello</p>. the html tags do not count towards the max length.

for example with this text if the max length was 3 I would need to slice it like so <p>hel. I'm trying to think of a clever way to do this. Wondering if anyone has any ideas while I try to work it out. I have the ability to strip the tags from the text so I have find where in the text I need to stop but the tricky part is find where that text lives in the original string.

UPDATE: Thanks to both comments I was inspired to write this, hopefully it helps anyone else with this problem:

const cutHTMLCharactersByMaxlength = (htmlString: string, maxLength: number) => {
  const parent = document.createElement('div');
  parent.innerHTML = htmlString;
  countHTMLCharactersRecurse(parent, maxLength, 0);

  return parent.innerHTML;
}

const countHTMLCharactersRecurse = (node: Node, maxLength: number, count: number) => {
  if (node.nodeType === 3) {
    if (maxLength === count) {
      node.textContent = "";
    } else {
      count += node.textContent?.length || 0;
      if (count > maxLength) {
        const diff = count - maxLength;
        node.textContent = node.textContent?.slice(0, -1*diff) || "";
        count = maxLength;
      }
    }
    return count;
  }

  const numChildren = node.childNodes.length;
  for (let index = 0; index < numChildren; index++) {
    count = countHTMLCharactersRecurse(node.childNodes[index], maxLength, count);
  }
  return count;
}

2 Replies 2

write your own length method which iterates over the characters of your input and only counts characters outside the tags (since you are not interested in the type of the tags. So you need a kind of primitive html-parsing that can detect tags.
If you can ensure that the tags have no parameter values that contain '<' or '>' (counter example: <img alt="the > sdkfj"> )you can maybe get away with skipping everything between '<' and the next '>'.

If you want to be able to handle the example above you need at least keep track if you are inside a constant.

I would need to slice it like so <p>hel

But you want valid HTML as result, right? So that would actually have to be <p>hel</p> then.

And if you got nested HTML, it gets more complex. <p>hello <strong>people</strong>!</p>, truncated to 8 characters, would require <p>hello <strong>pe</strong></p> as result.

So this should be handled using DOM methods, not string functionality or regex. You iterate over your nodes in a Depth-First Search approach, going down until you end up at the text nodes, summing up the length of the texts you encounter. Once you pass your threshold length, the current text node gets cut at that position. Then you keep iterating over "the rest" - those elements and text nodes need to be discarded.

Another thing to keep in mind: White-space handling. Depending on the context (pre element, white-space formatting applied, etc.), you might need to count multiple consecutive whitespace characters as individual characters - or as a single space. And whitespace immediately before/after opening/closing tags also gets some special treatment in HTML, in certain situations.

Your Reply

By clicking “Post Your Reply”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.