0

I'm a bit stuck. I have scraped a website and would now like to convert it into markdown. My html looks like this:

Some text more text, and more text. Some text more text, and more text. 
Once in a while  <span class="bold">something is bold</span>. 
Then some more text. And <span class="bold">more bold stuff</span>.

There are html to markdown modules available, however, they would only work if the text <b> looked like this </b>.

How could I go through the html, and everytime I find a span which is supposed to bold something, turn this piece of the html into bold markdown, that is, make it **look like this**

3
  • can you use regex to do a string replace? Commented Jul 7, 2017 at 20:28
  • I'm not sure how that would work. Commented Jul 7, 2017 at 20:35
  • You could basically replace all the span tags and make them bold tag And then convert to markdown Commented Jul 7, 2017 at 20:48

2 Answers 2

2

Try this one https://github.com/domchristie/to-markdown, an HTML to Markdown converter written in JavaScript.

It can be extended by passing in an array of converters to the options object:

toMarkdown(stringOfHTML, { converters: [converter1, converter2, …] });

In your case, the converter can be

{
    filter: 'span',
    replacement: function(content) {
       return '**' + content + '**';
   }
}

Refer to its readme for more details.

Sign up to request clarification or add additional context in comments.

5 Comments

hey, that's amazing! I just tried it and it works really great. Is there any way I can specify that it not only should be a span, but a span with the class bold ?
@GeorgeWelder The filter can be a function that returns a boolean depending on whether a given node should be replaced. The function is passed a DOM node as its only argument. In your case, the filter can be function (node) { return node.nodeName === 'SPAN' && /bold/i.test(node.className); }
this is amazing. how can i learn this, or dive more into this? What is this called exactly? I don't understand any of it, but it works :)
@GeorgeWelder You can read the module's README on github, I learned it from there, and you can read it's source code too, it's not very difficult to understand.
@GeorgeWelder To understand it you'll need some knowledge of HTML DOM
-1

Notepad++ is an open-source editor that supports regex. This picture shows the basic idea.

You know how to use an editor to find and replace strings. In an editor like Notepad++ you can look for string patterns and replace parts of the patterns and keep what's left. In your case, you want to find strings that are framed by HTML markup. Here the regex in the 'Find what' edit box displays that, with the special notation ([^<]*) meaning save zero or more of any character other than the '<' for use in a replacement string. The 'Replace with' edit box says used what was saved (as \1) in the expression **\1** which gives you what you prefer to have in the text file. It remains to click on 'Replace all'.

using Notepad++

To be able to do this you need to install Notepad++ and learn some basic Perl regex. To get this dialogue box click on Ctl-H. Of course, if you get it wrong there's always Ctl-Z.

2 Comments

I do appreciate hit-and-run down votes that say nothing about what was wrong with the answer.
OP asked for a way to do it in NodeJS/javascript, not in a text editor.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.