1

I am trying to write a javascript regex only matching NASM-style comments in HTML. For example, matching "; interrupt" for "INT 21h ; interrupt".

You may know /;.*/ can't be the answer because there can be a HTML entity before the comment; I thought /(?:[^&]|&.+;)*(;.*)$/ should work for it, but I found it has two problems:

  1. "      ; hello world".match(/(?:[^&]|&.+;)*(;.*)$/) is an array ["      ; hello world", "; hello world"]. I don't want an array.
  2. "      ; hello world; a message".match(/(?:[^&]|&.+;)*(;.*)$/) is ["      ; hello world; a message", "; a message"]; even worse the second element.

Question:

  1. Why is (?:) block returned?
  2. Why "; a message", not "; hello world; a message"?
  3. What's the right regex I can use?

2 Answers 2

1

1) The (?:) is not being returned. What you are seeing is that the .match() method will always return an array: The first element is the whole match, and the following elements (if any) are the back-references. In this case, you have one back-reference, so the array contains two items.

2) Because of the first half of your regex:

(?:[^&]|&.+;)*

This is not a good idea! This will match just about anything, even including new lines! In fact, the only thing it won't match is a "&" that is not followed by a ";" on the same line. Thus, it is matching everything up to the last ";" in each of your lines.

3) I'm not at all familiar with MASM-style comments in HTML, so I'd need to see a more extensive list of what you want matched/not matched in order to confidently give a good answer here.

But here's something I've thrown together very quickly, to at least solve the two examples you gave above:

.*&.*?;\s(;.*)$
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! While your example can not be adopted for general cases, mention on multiline was very helpful.
0

ad 1.) the ?: block is not returned. instead, the complete match is returned in the first array element. this behavior follows the specification for non-global matching (ie. without g option).

ad 2.) the first part of your regex ((?:[^&]|&.+;)*) matches too much. in fact it would match the complete line if you dropped the second portion. in plain english you asked to match a sequence of & followed by as many characters as possible followed by a ;, or any symbol other than &, respectively, and you ask the engine to repeat this match as often as possible until the last ; in the test string (if there is one).

ad 3.) try

(?:[^&;]*(&[a-zA-Z0-9_-]+;[^&;]*)*)(;.*)$

it fixes the broken entity matching and returns the longest ;-initial suffix.

tested with pagecolumn regex tester (i'm not affiliated with this website).

1 Comment

Thanks! After changing your example to (?:[^&;]*(?:&[^\s;]+;[^&;]*)*)(;.*)$ to care entities like 각, I decided to use (?:[^&;]|&[^;\s]+;)*(;.*)$.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.