\z PCRE equivalent in JavaScript regex to match all markdown list items

Question

I'm trying to parse a Markdown style list into HTML. I am using several Regular Expressions for this, all according to the JavaScript standard. I know there are several different tools out there to do this, however I thought it would be a good way to practice my RegEx's. I ran into an issue however.

After retrieving a list "block" with both ordered and unordered lists I need to parse the block into different list items. The items have the possibility of being indented, and are therefore spread across multiple lines like so:

1. text
2. text
  1. text
  2. text
* text
* text
  - text
  + text
1. text
  * text
  1. text
* text
  1. text
  * text

I have created this RegEx to separate out the different first level list elements and includes the sub-list markdown of the element.

/^(?:\d.|[*+-]) [^]*?(?=^(?:\d.|[*+-]))/gm

Which should achieve these matches...

What I am trying to acheive

1. text

2. text
  1. text
  2. text

* text

* text
  - text
  + text

1. text
  * text
  1. text

* text
  1. text
  * text

However, this separates out all list elements except for the last one, as I am using a positive look-ahead to match only list elements that are followed by another list element. Which results in this...

What actually happens when using this RegEx

1. text

2. text
  1. text
  2. text

* text

* text
  - text
  + text

1. text
  * text
  1. text

As you can see, the last list element is missing.

My thought was to match only list elements that are followed by another list element OR match list elements that are followed by an end of string, like this.

/^(?:\d.|[*+-]) [^]*?(?=^(?:\d.|[*+-])|$)/gm

This doesn't work because I am using the multiline flag. I can't use /Z either since i'm working in JavaScript.

Does anyone know of another way to tackle this problem? Regex101: see this page for the example

Yeah, use $(?![^]). And escape the dot since you want to match a literal dot after a digit — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Dec 28, 2019 at 22:13

Wiktor Stribiżew · Accepted Answer · 2019-12-28 22:41:27Z

3

If you want to match the very end of string position in a JavaScript regex that has a m flag you may use $(?![^]) or $(?![\s\S]) like pattern. Your pattern will look like

/^(?:\d.|[*+-]) [^]*?(?=^(?:\d.|[*+-])|$(?![^]))/gm
                                       ^^^^^^^^

See the regex demo. The $(?![^]) (or $(?![\s\S])) matches the end of a line that has no other char right after it (so, the very end of the string).

However, you should think of unrolling the lazy dot part to make the pattern work more efficiently.

Here is an example:

/^(?:\d+\.|[*+-]) .*(?:\r?\n(?!(?:\d+\.|[*+-]) ).*)*/gm

See the regex demo

Details

^ - start of a line
(?:\d+\.|[*+-]) - 1+ digits and a dot or a * / + / -
- a space
.* - any 0+ chars other than line break chars as many as possible
(?:\r?\n(?!(?:\d+\.|[*+-]) ).*)* - 0 or more sequences of a CRLF or an LF line ending not followed with - 1+ digits and a dot or a * / + / - followed with a space and then the rest of the line.

edited Dec 28, 2019 at 22:41

answered Dec 28, 2019 at 22:21

Wiktor Stribiżew

631k41 gold badges502 silver badges633 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Nick Over a year ago

Learnt something new today, thanks. I'm curious as to whether you think there's any performance difference between searching for ^ in multi-line more or \n not using multi-line?

Wiktor Stribiżew Over a year ago

@Nick Any group with an alternation operator involves backtracking and the more to the left of the longer pattern it is located, the worse it is from the performance point of view. However, real performance difference is better calculated in the target environment.

Laika Over a year ago

Thank you so much, this solution works perfectly! The only thing I altered was adding a space after "^(?:\d+\.|[*+-])" as it is necessary for the Markdown standard. I was also wondering if you could explain as to why the second expression you offered is more efficient, and if you have any sources I could learn more about RegEx efficiency.

Wiktor Stribiżew Over a year ago

@Laika The lazy [^]*? pattern is very inefficient when working with long strings as it has to expand each time the subsequent subpatterns fail to find a match. It only expands by one char each time, and it may lead to time out issues (the pattern might turn out too slow). When you use .* to match whole lines, matching is faster assuming your input does not consist of just millions of lines.

Collectives™ on Stack Overflow

\z PCRE equivalent in JavaScript regex to match all markdown list items

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related