Turn html text into markdown manually (javascript / nodejs)

Question

I'm a bit stuck. I have scraped a website and would now like to convert it into markdown. My html looks like this:

Some text more text, and more text. Some text more text, and more text. 
Once in a while  <span class="bold">something is bold</span>. 
Then some more text. And <span class="bold">more bold stuff</span>.

There are html to markdown modules available, however, they would only work if the text <b> looked like this </b>.

How could I go through the html, and everytime I find a span which is supposed to bold something, turn this piece of the html into bold markdown, that is, make it **look like this**

You could basically replace all the span tags and make them bold tag And then convert to markdown — Shivam
– Shivam, Commented Jul 7, 2017 at 20:48

Yi Kai · Accepted Answer · 2017-07-08 08:04:48Z

2

Try this one https://github.com/domchristie/to-markdown, an HTML to Markdown converter written in JavaScript.

It can be extended by passing in an array of converters to the options object:

toMarkdown(stringOfHTML, { converters: [converter1, converter2, …] });

In your case, the converter can be

{
    filter: 'span',
    replacement: function(content) {
       return '**' + content + '**';
   }
}

Refer to its readme for more details.

edited Jul 8, 2017 at 8:04

answered Jul 8, 2017 at 3:38

Yi Kai

6406 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

George Welder Over a year ago

hey, that's amazing! I just tried it and it works really great. Is there any way I can specify that it not only should be a span, but a span with the class bold ?

Yi Kai Over a year ago

@GeorgeWelder The filter can be a function that returns a boolean depending on whether a given node should be replaced. The function is passed a DOM node as its only argument. In your case, the filter can be function (node) { return node.nodeName === 'SPAN' && /bold/i.test(node.className); }

George Welder Over a year ago

this is amazing. how can i learn this, or dive more into this? What is this called exactly? I don't understand any of it, but it works :)

Yi Kai Over a year ago

@GeorgeWelder You can read the module's README on github, I learned it from there, and you can read it's source code too, it's not very difficult to understand.

Yi Kai Over a year ago

@GeorgeWelder To understand it you'll need some knowledge of HTML DOM

Bill Bell · Accepted Answer · 2017-07-07 21:15:23Z

-1

Notepad++ is an open-source editor that supports regex. This picture shows the basic idea.

You know how to use an editor to find and replace strings. In an editor like Notepad++ you can look for string patterns and replace parts of the patterns and keep what's left. In your case, you want to find strings that are framed by HTML markup. Here the regex in the 'Find what' edit box displays that, with the special notation ([^<]*) meaning save zero or more of any character other than the '<' for use in a replacement string. The 'Replace with' edit box says used what was saved (as \1) in the expression **\1** which gives you what you prefer to have in the text file. It remains to click on 'Replace all'.

To be able to do this you need to install Notepad++ and learn some basic Perl regex. To get this dialogue box click on Ctl-H. Of course, if you get it wrong there's always Ctl-Z.

answered Jul 7, 2017 at 21:15

Bill Bell

21.7k6 gold badges48 silver badges62 bronze badges

2 Comments

Bill Bell Over a year ago

I do appreciate hit-and-run down votes that say nothing about what was wrong with the answer.

kjsmita6 Over a year ago

OP asked for a way to do it in NodeJS/javascript, not in a text editor.

Collectives™ on Stack Overflow

Turn html text into markdown manually (javascript / nodejs)

2 Answers 2

5 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related