Converting HTML to text in NodeJs (outside of the browser)

Question

How do you convert html to text efficiently using NodeJS, i.e. outside of the browser? I also want to convert entities like ä to ä, etc and not only just remove tags from the html.

Here is a JEST unit test for a a function convertHtmlToText which does this conversion:

it('when extract from partial html should extract text', () => {
  const html = `<p>&nbsp;&auml;&uuml;
\t<img alt="" src="http://www.test.org:80/imageupload/userfiles/2/images/world med new - 2022.jpg" style="width: 2000px; height: 1047px; max-width: 100%; height: auto;" /></p>
<p>
\tAn evening of music, silence and guiding thoughts to help us experience inner peace, connect with the Divine and share loving vibrations with the world. Join millions of people throughout the world to contribute in creating a wave of peace.</p>
<div>
\t&nbsp;</div>
<div>
\t<strong>Please join ....</strong></div>
<div>
\t&nbsp;</div>
<div>
\t<strong>Watch live:&nbsp;<a href="https://test.org/watchlive" target="_blank">test.org/watchlive</a></strong></div>`
  const text = convertHtmlToText(html)
  console.log(text)
  expect(text).toContain("ä");
  expect(text).toContain("ü");
  expect.not.stringContaining("<")
  expect.not.stringContaining(">")
});

gil.fernandes · Accepted Answer · 2022-02-18 11:02:32Z

2

One possible solution for this question would be to use a library like e.g: jsdom

This is the function which removes tags and also converts entities from any html text:

const jsdom = require("jsdom");
const { JSDOM } = jsdom;

const convertHtmlToText = (html) => {
  if(!html) {
    return ""
  }
  const dom = new JSDOM(html)
  const textContent = dom.window.document.documentElement.textContent
  // removing unnecessary spaces
  return textContent.replace(/\s+/gm, ' ').trim()
}

module.exports = {
  convertHtmlToText
}

answered Feb 18, 2022 at 11:02

gil.fernandes

14.8k7 gold badges78 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Segun Adeniji · Accepted Answer · 2022-02-18 11:33:33Z

-1

let HTMLContent = `<div> my&apos; <a href="profile/lol">profile</a></div>`;

let strippedHtml = decodeHTMLEntities(HTMLContent.replace(/<[^>]+>/g, ''));
console.log(strippedHtml)

function decodeHTMLEntities(text) {
  var entities = [
    ['amp', '&'],
    ['apos', '\''],
    ['#x27', '\''],
    ['#x2F', '/'],
    ['#39', '\''],
    ['#47', '/'],
    ['lt', '<'],
    ['gt', '>'],
    ['nbsp', ' '],
    ['quot', '"']
  ];

  for (var i = 0, max = entities.length; i < max; ++i) {
    text = text.replace(new RegExp('&' + entities[i][0] + ';', 'g'), entities[i][1]);
  }
  return text;
}

try this

edited Feb 18, 2022 at 11:33

answered Feb 18, 2022 at 11:17

Segun Adeniji

3905 silver badges12 bronze badges

5 Comments

gil.fernandes Over a year ago

Hello, this is not bad, but I also want entities like  ä&uuml to be properly converted to text.

Segun Adeniji Over a year ago

sorry syntax fixed, try this. thanks

Quentin Over a year ago

That's very fragile (e.g. it will break if an attribute value contains >), the list of supported entities is very short, and it doesn't process whitespace correctly.

Segun Adeniji Over a year ago

@Quentin can you provide him with a better solution? he is using node.js, not browser js that has DOM to manipulate

Quentin Over a year ago

@SegunAdeniji — gil.fernandes already has

Collectives™ on Stack Overflow

Converting HTML to text in NodeJs (outside of the browser)

2 Answers 2

Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related