How can I convert HTML to Object structure with text and formatting?

Question

I need to convert a HTML String with nested Tags like this one:

const strHTML = "<p>Hello World</p><p>I am a text with <strong>bold</strong> word</p><p><strong>I am bold text with nested <em>italic</em> Word.</strong></p>"

Into the following Array of objects with this Structure:

const result = [{
  text: "Hello World",
  format: null
}, {
  text: "I am a text with",
  format: null
}, {
  text: "bold",
  format: ["strong"]
}, {
  text: " word",
  format: null
}, {
  text: "I am a text with nested",
  format: ["strong"]
}, {
  text: "italic",
  format: ["strong", "em"]
}, {
  text: "Word.",
  format: ["strong"]
}];

I managed the conversion with the DOMParser() as long as there are no nested Tags. I am not able to get it running with nested Tags, like in the last paragraph, so my whole paragraph is bold, but the word "italic" should be both bold and italic. I cannot get it running as a recursion.

Any help would be appreciated.

So the code I wrote so far is this one:

export interface Phrase {
    text: string;
    format: string | string[];
}

export class HTMLParser {

    public parse(text: string): void {
        const parser = new DOMParser();
        const sourceDocument = parser.parseFromString(text, "text/html");
        this.parseChildren(sourceDocument.body.childNodes);

        // HERE SHOULD BE the result
        console.log("RESULT of CONVERSION", this.phrasesProcessed);
    }

    public phrasesProcessed: Phrase[] = [];

    private parseChildren(toParse: NodeListOf<ChildNode>) {
        this.phrasesProcessed = [];
        try {
            Array.from(toParse)
                .map(item => {
                    if (item.nodeType === Node.ELEMENT_NODE && item instanceof HTMLElement) {
                        return Array.from(item.childNodes).map(child => ({ text: child.textContent, format: (child.nodeType === Node.ELEMENT_NODE && child instanceof HTMLElement) ? child.tagName : null }));
                    } else {
                        return Array.from(item.childNodes).map(child => ({ text: child.textContent, format: null }));
                    }
                })
                .filter(line => line.length) // only non emtpy arrays
                .map(element => ([...element, { text: "\n", format: null }])) // add linebreak after each P
                .reduce((acc: (Phrase)[], val) => acc.concat(val), []) // flatten
                .forEach(
                    element => {
                        // console.log("ELEMENT", element);
                        this.phrasesProcessed.push(element);
                    }
                );
        } catch (e) {
            console.warn(e);
        }
    }

}

Include the code that uses DOMParser(), need to see how you're doing what you claim to be doing so it can be fixed. ATM it really looks like you are asking us to write the whole code. — zer00ne
– zer00ne, Commented Jun 8, 2022 at 13:01
Why is "p" not included in format? What are your rules to include a tag in the format arrays? — trincot
– trincot, Commented Jun 8, 2022 at 13:13
Ok. Sorry it wasn't my intention to have youe write the whole code... — thooyork
– thooyork, Commented Jun 8, 2022 at 13:15
This data format doesn't make it clear which tags should be nested really. I would say you need to rethink that part first. — Michał Sadowski
– Michał Sadowski, Commented Jun 8, 2022 at 13:21

trincot · Accepted Answer · 2022-06-08 17:04:57Z

4

You can use recursion. And this seems a good case for a generator function. As it was not clear which tags should be retained in format (apparently, not p), I left this as a configuration to provide:

const formatTags = new Set(["b", "big", "code", "del", "em", "i", "pre", "s", "small", "strike", "strong", "sub", "sup", "u"]);

function* iterLeafNodes(nodes, format=[]) {
    for (let node of nodes) {
        if (node.nodeType == 3) {
            yield ({text: node.nodeValue, format: format.length ? [...format] : null});
        } else {
            const tag = node.tagName.toLowerCase();
            yield* iterLeafNodes(node.childNodes, 
                                 formatTags.has(tag) ? format.concat(tag) : format);
        }
    }
}

// Example input
const strHTML = "<p>Hello World</p><p>I am a text with <strong>bold</strong> word</p><p><strong>I am bold text with nested <em>italic</em> Word.</strong></p>"
const nodes = new DOMParser().parseFromString(strHTML, 'text/html').body.childNodes;
let result = [...iterLeafNodes(nodes)];

console.log(result);

Note that this will still split the text when it is spread over multiple tags, which are considered non-formatting tags, like span.

Secondly, I'm not convinced that having null as a possible value for format is more useful then just an empty array [], but anyway, the above produces null in that case.

Special case - insertion of `\n`

In comments you ask for the insertion of a line break after each p element.

The code below will generate that extra element. Here I also used [] instead of null for format:

const formatTags = new Set(["b", "big", "code", "del", "em", "i", "pre", "s", "small", "strike", "strong", "sub", "sup", "u"]);

function* iterLeafNodes(nodes, format=[]) {
    for (let node of nodes) {
        if (node.nodeType == 3) {
            yield ({text: node.nodeValue, format: [...format]});
        } else {
            const tag = node.tagName.toLowerCase();
            yield* iterLeafNodes(node.childNodes, 
                                 formatTags.has(tag) ? format.concat(tag) : format);
            if (tag === "p") yield ({text: "\n", format: [...format]});
        }
    }
}

// Example input
const strHTML = "<p>Hello World</p><p>I am a text with <strong>bold</strong> word</p><p><strong>I am bold text with nested <em>italic</em> Word.</strong></p>"
const nodes = new DOMParser().parseFromString(strHTML, 'text/html').body.childNodes;
let result = [...iterLeafNodes(nodes)];

console.log(result);

edited Jun 8, 2022 at 17:04

answered Jun 8, 2022 at 13:32

trincot

357k38 gold badges282 silver badges339 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

thooyork Over a year ago

Thanks !!!! a lot ! That absolutely made my day ! You're right an empty array is better than null in that case. However - how can i manage to add {text: "\n", format: null} after each "P" tag ?

trincot Over a year ago

That's what you would get if in the original HTML you would have that line break after each P tag. If not, then you need to explicitly make an exception for that case in the algorithm, which makes it less elegant.

thooyork Over a year ago

Ok I see. Thanks a lot. Just for the better understanding: how would I do this exception - this is not really clear to me where to hook in the algorythm . Sorry but I am kind of nooby here ....

trincot Over a year ago

Added a version to my answer that does this. Let me know if this suits your needs.

thooyork Over a year ago

You're awesome ! Thanks a lot , you really made my day with this, Since I couldn't figure out how to implement this with recursion. Also that concept of generator function / yield is new to me - I definetly will dive deeper into this !

Som Shekhar Mukherjee · Accepted Answer · 2022-06-08 13:35:40Z

You can recursively loop over the child nodes and construct the desired array using an array like FORMAT_NODES.

const FORMAT_NODES = ["strong", "em"];

function getText(node, parents = [], res = []) {
  if (node.nodeName === "#text") {
    const text = node.textContent.trim();
    if (text) {
      const format = parents.filter((p) => FORMAT_NODES.includes(p));
      res.push({ text, format: format.length ? format : null });
    }
  } else {
    node.childNodes.forEach((node) =>
      getText(node, parents.concat(node.nodeName.toLowerCase()), res)
    );
  }
  return res;
}

const container = document.querySelector("#container");
const result = getText(container);
console.log(result);

<div id="container">
  <p>Hello World</p>
  <p>I am a text with <strong>bold</strong> word</p>
  <p><strong>I am bold text with nested <em>italic</em> Word.</strong></p>
</div>

Relevant Documentations:

Scott Sauyet · Accepted Answer · 2022-06-08 16:16:38Z

A version not all that different from the other two posted here, but with a different breakdown in responsibilities.

const getTextNodes = (node, path = []) =>
  node .nodeType === 3
    ? {text: node .nodeValue, path}
    : [... node .childNodes] .flatMap ((child) => getTextNodes (child, [... path, node .tagName .toLowerCase()]))

const extract = (keep) => (html) =>
  [...new DOMParser () .parseFromString (html, 'text/html') .body .childNodes] 
    .flatMap (node => getTextNodes (node))
    .map (({text, path = []}) => ({text, format: [...new Set (path .filter (p => keep .includes (p)))]}))

const reformat = extract (["em", "strong"])

const strHTML = "<p>Hello World</p><p>I am a text with <strong>bold</strong> word</p><p><strong>I am bold text with nested <em>italic</em> Word.</strong></p>"

console .log (reformat (strHTML))

.as-console-wrapper {max-height: 100% !important; top: 0}

This goes through an intermediate format, which might be useful for other purposes:

[
  {text: "Hello World", path: ["p"]},
  {text: "I am a text with ", path: ["p"]},
  {text: "bold", path: ["p", "strong"]},
  {text: " word", path: ["p"]},
  {text: "I am bold text with nested ", path: ["p", "strong"]},
  {text: "italic", path: ["p", "strong", "em"]},
  {text: " Word.", path: ["p", "strong"]}
]

While this looks similar to your final format, the path includes the entire tag history to the text node, and could be used for various purposes. getTextNodes extracts this format from a given node. Thus a path might look like ["div", "div", "div", "nav", "ol", "li", "a", "div", "div", "strong"], with repeated elements, and many non-formatting tags.

The final map call in extract simply filters this path into your configured collection of formatting tags.

While we can easily do this in a single pass, getTextNodes is itself a useful function we might use elsewhere in our system.

Collectives™ on Stack Overflow

How can I convert HTML to Object structure with text and formatting?

3 Answers 3

Special case - insertion of `\n`

5 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Special case - insertion of \n

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related

Special case - insertion of `\n`