0

How do you convert html to text efficiently using NodeJS, i.e. outside of the browser? I also want to convert entities like ä to ä, etc and not only just remove tags from the html.

Here is a JEST unit test for a a function convertHtmlToText which does this conversion:

it('when extract from partial html should extract text', () => {
  const html = `<p>&nbsp;&auml;&uuml;
\t<img alt="" src="http://www.test.org:80/imageupload/userfiles/2/images/world med new - 2022.jpg" style="width: 2000px; height: 1047px; max-width: 100%; height: auto;" /></p>
<p>
\tAn evening of music, silence and guiding thoughts to help us experience inner peace, connect with the Divine and share loving vibrations with the world. Join millions of people throughout the world to contribute in creating a wave of peace.</p>
<div>
\t&nbsp;</div>
<div>
\t<strong>Please join ....</strong></div>
<div>
\t&nbsp;</div>
<div>
\t<strong>Watch live:&nbsp;<a href="https://test.org/watchlive" target="_blank">test.org/watchlive</a></strong></div>`
  const text = convertHtmlToText(html)
  console.log(text)
  expect(text).toContain("ä");
  expect(text).toContain("ü");
  expect.not.stringContaining("<")
  expect.not.stringContaining(">")
});

2 Answers 2

2

One possible solution for this question would be to use a library like e.g: jsdom

This is the function which removes tags and also converts entities from any html text:

const jsdom = require("jsdom");
const { JSDOM } = jsdom;

const convertHtmlToText = (html) => {
  if(!html) {
    return ""
  }
  const dom = new JSDOM(html)
  const textContent = dom.window.document.documentElement.textContent
  // removing unnecessary spaces
  return textContent.replace(/\s+/gm, ' ').trim()
}

module.exports = {
  convertHtmlToText
}
Sign up to request clarification or add additional context in comments.

Comments

-1

let HTMLContent = `<div> my&apos; <a href="profile/lol">profile</a></div>`;

let strippedHtml = decodeHTMLEntities(HTMLContent.replace(/<[^>]+>/g, ''));
console.log(strippedHtml)

function decodeHTMLEntities(text) {
  var entities = [
    ['amp', '&'],
    ['apos', '\''],
    ['#x27', '\''],
    ['#x2F', '/'],
    ['#39', '\''],
    ['#47', '/'],
    ['lt', '<'],
    ['gt', '>'],
    ['nbsp', ' '],
    ['quot', '"']
  ];

  for (var i = 0, max = entities.length; i < max; ++i) {
    text = text.replace(new RegExp('&' + entities[i][0] + ';', 'g'), entities[i][1]);
  }
  return text;
}

try this

5 Comments

Hello, this is not bad, but I also want entities like &nbsp;&auml;&uuml to be properly converted to text.
sorry syntax fixed, try this. thanks
That's very fragile (e.g. it will break if an attribute value contains >), the list of supported entities is very short, and it doesn't process whitespace correctly.
@Quentin can you provide him with a better solution? he is using node.js, not browser js that has DOM to manipulate
@SegunAdeniji — gil.fernandes already has

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.