3

I'm trying to parse a Markdown style list into HTML. I am using several Regular Expressions for this, all according to the JavaScript standard. I know there are several different tools out there to do this, however I thought it would be a good way to practice my RegEx's. I ran into an issue however.

After retrieving a list "block" with both ordered and unordered lists I need to parse the block into different list items. The items have the possibility of being indented, and are therefore spread across multiple lines like so:

1. text
2. text
  1. text
  2. text
* text
* text
  - text
  + text
1. text
  * text
  1. text
* text
  1. text
  * text

I have created this RegEx to separate out the different first level list elements and includes the sub-list markdown of the element.

/^(?:\d.|[*+-]) [^]*?(?=^(?:\d.|[*+-]))/gm

Which should achieve these matches...

What I am trying to acheive

1. text

2. text
  1. text
  2. text

* text

* text
  - text
  + text

1. text
  * text
  1. text

* text
  1. text
  * text

However, this separates out all list elements except for the last one, as I am using a positive look-ahead to match only list elements that are followed by another list element. Which results in this...

What actually happens when using this RegEx

1. text

2. text
  1. text
  2. text

* text

* text
  - text
  + text

1. text
  * text
  1. text

As you can see, the last list element is missing.

My thought was to match only list elements that are followed by another list element OR match list elements that are followed by an end of string, like this.

/^(?:\d.|[*+-]) [^]*?(?=^(?:\d.|[*+-])|$)/gm

This doesn't work because I am using the multiline flag. I can't use /Z either since i'm working in JavaScript.

Does anyone know of another way to tackle this problem? Regex101: see this page for the example

1
  • Yeah, use $(?![^]). And escape the dot since you want to match a literal dot after a digit Commented Dec 28, 2019 at 22:13

1 Answer 1

3

If you want to match the very end of string position in a JavaScript regex that has a m flag you may use $(?![^]) or $(?![\s\S]) like pattern. Your pattern will look like

/^(?:\d.|[*+-]) [^]*?(?=^(?:\d.|[*+-])|$(?![^]))/gm
                                       ^^^^^^^^ 

See the regex demo. The $(?![^]) (or $(?![\s\S])) matches the end of a line that has no other char right after it (so, the very end of the string).

However, you should think of unrolling the lazy dot part to make the pattern work more efficiently.

Here is an example:

/^(?:\d+\.|[*+-]) .*(?:\r?\n(?!(?:\d+\.|[*+-]) ).*)*/gm

See the regex demo

Details

  • ^ - start of a line
  • (?:\d+\.|[*+-]) - 1+ digits and a dot or a * / + / -
  • - a space
  • .* - any 0+ chars other than line break chars as many as possible
  • (?:\r?\n(?!(?:\d+\.|[*+-]) ).*)* - 0 or more sequences of a CRLF or an LF line ending not followed with - 1+ digits and a dot or a * / + / - followed with a space and then the rest of the line.
Sign up to request clarification or add additional context in comments.

4 Comments

Learnt something new today, thanks. I'm curious as to whether you think there's any performance difference between searching for ^ in multi-line more or \n not using multi-line?
@Nick Any group with an alternation operator involves backtracking and the more to the left of the longer pattern it is located, the worse it is from the performance point of view. However, real performance difference is better calculated in the target environment.
Thank you so much, this solution works perfectly! The only thing I altered was adding a space after "^(?:\d+\.|[*+-])" as it is necessary for the Markdown standard. I was also wondering if you could explain as to why the second expression you offered is more efficient, and if you have any sources I could learn more about RegEx efficiency.
@Laika The lazy [^]*? pattern is very inefficient when working with long strings as it has to expand each time the subsequent subpatterns fail to find a match. It only expands by one char each time, and it may lead to time out issues (the pattern might turn out too slow). When you use .* to match whole lines, matching is faster assuming your input does not consist of just millions of lines.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.