1

I just need someone to correct my understanding of this regex , which is like a stopgap arrangement for matching HTML tags.

< (?: "[^"]*" ['"]* | '[^']*'['"]*|[^'">])+ >

My understanding -

  • < -Match the tag open symbol
  • (?: - Cant understand whats going on here . What do these symbols mean?
  • "[^"]*['"]* An arbitrary string in double quotes. Something else going here ?
  • '[^']*'['"]* - Some string in single quotes
  • [^'">] - Any character other than ' " >.

So its a '<' symbol .Followed by a string in double quotes or in single quotes or any other string which dosent contain ' " or > , repeated one or more times followed by a '>' .
Thats the best I could make out .

1
  • 1
    I think your understanding looks sound. But with all things Regex you should get yourself a 'regular expessions tester' and check a few scenarios to be sure (I use a firefox plugin that does the job). Commented Oct 4, 2012 at 7:39

3 Answers 3

5
<       # literally just an opening tag followed by a space
(       # the bracket opens a subpattern, it's necessary as a boundary for
        # the | later on
?:      # makes the just opened subpattern non-capturing (so you can't access it
        # as a separate match later
"       # literally "
[^"]    # any character but " (this is called a character class)
*       # arbitrarily many of those (as much as possible)
"       # literally "
['"]    # either ' or "
*       # arbitrarily many of those (and possible alternating! it doesn't have
        # to be the same character for the whole string)
|       # OR
'       # literral *
[^']    # any character but ' (this is called a character class)
*       # arbitrarily many of those (as much as possible)
'       # literally "
['"]*   # as above
|       # OR
[^'">]  # any character but ', ", >
)       # closes the subpattern
+       # arbitrarily many repetitions but at least once
>       # closing tag

Note that all the spaces in the regex are treated just like any other character. They match exactly one space.

Also take special note of the ^ at the beginning of character classes. It's not treated as a separate character, but inverts the whole character class.

I may also (obligatorily) add, that regular expressions are not appropriate to parse HTML.

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks for the great answer , non capturing subpattern... googling into it
That's probably a good idea. It's quite a powerful concept when you want to extract data from within larger structurs or you need to replace those structures but keep the data within (using regex).
Theres just one more thing I cant understand ....The pattern "[^"]*" ['"]* should match "some random stuff here" , but why is there ['"]* at the end ? Does the * apply to the whole expression or to just the character set ['"] ?
It only applies to the character class ['"]. I'm not really sure what its purpose is because these characters are already taken care of by the third part of the alternation (the part after the second |). Also note, that this regex does NOT match self-closing tags, because they have no space in front of the close >.
I found this regex on one of the answers in the post you linked in your above answer . Since I'm just beginning with regex I needed clarification on it. Since we're both stumped , I guess it must be there to handle some obscure corner case we cant think of .
|
2

Split it up by the |s, which denote ors:

<
  (?:
    "[^"]*" ['"]* |
    '[^']*'['"]* |
    [^'">]
  )+
>

(?: denotes a non-matching group. The insides of that group match these things (in this order):

  1. "stuff"
  2. 'stuff'
  3. asd=

In effect, this is a regex that attempts to match HTML tags with attributes.

Comments

0

Here is the result of YAPE::Regex::Explain

(?-imsx:< (?: "[^"]*" ['"]* | '[^']*'['"]*|[^'">])+ >)

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  <                        '< '
----------------------------------------------------------------------
  (?:                      group, but do not capture (1 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
     "                       ' "'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '" '
----------------------------------------------------------------------
    ['"]*                    any character of: ''', '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
                             ' '
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
     '                       ' \''
----------------------------------------------------------------------
    [^']*                    any character except: ''' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
    ['"]*                    any character of: ''', '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    [^'">]                   any character except: ''', '"', '>'
----------------------------------------------------------------------
  )+                       end of grouping
----------------------------------------------------------------------
   >                       ' >'
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.