Could anyone explain this regex

Question

I just need someone to correct my understanding of this regex , which is like a stopgap arrangement for matching HTML tags.

< (?: "[^"]*" ['"]* | '[^']*'['"]*|[^'">])+ >

My understanding -

< -Match the tag open symbol
(?: - Cant understand whats going on here . What do these symbols mean?
"[^"]*['"]* An arbitrary string in double quotes. Something else going here ?
'[^']*'['"]* - Some string in single quotes
[^'">] - Any character other than ' " >.

So its a '<' symbol .Followed by a string in double quotes or in single quotes or any other string which dosent contain ' " or > , repeated one or more times followed by a '>' .
Thats the best I could make out .

I think your understanding looks sound. But with all things Regex you should get yourself a 'regular expessions tester' and check a few scenarios to be sure (I use a firefox plugin that does the job). — Stewart Ritchie
– Stewart Ritchie, Commented Oct 4, 2012 at 7:39

Community · Accepted Answer · 2017-05-23 12:18:49Z

5

<       # literally just an opening tag followed by a space
(       # the bracket opens a subpattern, it's necessary as a boundary for
        # the | later on
?:      # makes the just opened subpattern non-capturing (so you can't access it
        # as a separate match later
"       # literally "
[^"]    # any character but " (this is called a character class)
*       # arbitrarily many of those (as much as possible)
"       # literally "
['"]    # either ' or "
*       # arbitrarily many of those (and possible alternating! it doesn't have
        # to be the same character for the whole string)
|       # OR
'       # literral *
[^']    # any character but ' (this is called a character class)
*       # arbitrarily many of those (as much as possible)
'       # literally "
['"]*   # as above
|       # OR
[^'">]  # any character but ', ", >
)       # closes the subpattern
+       # arbitrarily many repetitions but at least once
>       # closing tag

Note that all the spaces in the regex are treated just like any other character. They match exactly one space.

Also take special note of the ^ at the beginning of character classes. It's not treated as a separate character, but inverts the whole character class.

I may also (obligatorily) add, that regular expressions are not appropriate to parse HTML.

edited May 23, 2017 at 12:18

CommunityBot

11 silver badge

answered Oct 4, 2012 at 7:38

Martin Ender

44.4k11 gold badges93 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Samhan Salahuddin Over a year ago

Thanks for the great answer , non capturing subpattern... googling into it

Martin Ender Over a year ago

That's probably a good idea. It's quite a powerful concept when you want to extract data from within larger structurs or you need to replace those structures but keep the data within (using regex).

Samhan Salahuddin Over a year ago

Theres just one more thing I cant understand ....The pattern "[^"]*" ['"]* should match "some random stuff here" , but why is there ['"]* at the end ? Does the * apply to the whole expression or to just the character set ['"] ?

Martin Ender Over a year ago

It only applies to the character class ['"]. I'm not really sure what its purpose is because these characters are already taken care of by the third part of the alternation (the part after the second |). Also note, that this regex does NOT match self-closing tags, because they have no space in front of the close >.

Samhan Salahuddin Over a year ago

I found this regex on one of the answers in the post you linked in your above answer . Since I'm just beginning with regex I needed clarification on it. Since we're both stumped , I guess it must be there to handle some obscure corner case we cant think of .

|

Blender · Accepted Answer · 2012-10-04 07:40:42Z

2

Split it up by the |s, which denote ors:

<
  (?:
    "[^"]*" ['"]* |
    '[^']*'['"]* |
    [^'">]
  )+
>

(?: denotes a non-matching group. The insides of that group match these things (in this order):

"stuff"
'stuff'
asd=

In effect, this is a regex that attempts to match HTML tags with attributes.

answered Oct 4, 2012 at 7:40

Blender

300k55 gold badges462 silver badges511 bronze badges

Comments

Toto · Accepted Answer · 2012-10-04 08:07:32Z

Here is the result of YAPE::Regex::Explain

(?-imsx:< (?: "[^"]*" ['"]* | '[^']*'['"]*|[^'">])+ >)

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  <                        '< '
----------------------------------------------------------------------
  (?:                      group, but do not capture (1 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
     "                       ' "'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '" '
----------------------------------------------------------------------
    ['"]*                    any character of: ''', '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
                             ' '
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
     '                       ' \''
----------------------------------------------------------------------
    [^']*                    any character except: ''' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
    ['"]*                    any character of: ''', '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    [^'">]                   any character except: ''', '"', '>'
----------------------------------------------------------------------
  )+                       end of grouping
----------------------------------------------------------------------
   >                       ' >'
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

Collectives™ on Stack Overflow

Could anyone explain this regex

3 Answers 3

6 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related