Seeking regex for HTML attributes meeting specific criteria

Question

I'm trying to remove single quotes and double quotes around HTML attributes with the following restrictions:

1) The quoted material MUST exist within a tag <> (e.g., <mytag b="yes"> becomes <mytag b=yes>, but <script>var b="yes"</script> stays intact).

2) The quoted material may not have a space character nor an equal sign (e.g., <mytag b="no no" c="no=no"> stays intact).

3) The quoted material may not be in an href or src definition.

4) The regex should be good for UTF-8 (duh!)

Someone posted a virtually identical question here that received an answer that works within the confines of the question:

Removing single and double quote from html attributes with no white spaces on all attributes except href and src

So:

((\S)+\s*(?<!href)(?<!src)(=)\s*)(\"|\')(\S+)(\"|\')

...works, except it fails to isolate text within tags (i.e., text in between opening and closing tags is erroneously edited, e.g. <mytag>"The quotes are stripped out here!"</mytag>), and it doesn't check for equal signs (=) within the quoted text (e.g. <mytag b="OhNo=TheRoutineRemovedTheQuotesBecauseItDidNotCheckForAnEqualSignInTheQuotedText!">).

Bonus points: I wish to integrate this into this php HTML minification routine, which works well except for the edits described above:

https://gist.github.com/tovic/d7b310dea3b33e4732c0

His solution pairs the patterns and replacement params in two arrays, as you'll see, so I need to conform to his syntax, which uses #, etc.

Your solution get my upvote!

This seems like a bad idea. You should try using an HTML parser instead. — Laurel
– Laurel, Commented May 2, 2016 at 4:54

Wiktor Stribiżew · Accepted Answer · 2016-05-02 14:44:37Z

1

Here is a pure regex way of getting rid of the quotes:

'~(?:<\w+|(?!^)\G)(?:\s+(?:src|href)=(?:"[^"]*"|'[^']*'))*\s+(?!(?:href|src)=)\w+=\K(?|"([^\s"=]*)"|'([^\s'=]*)')~u'

See the regex demo, replace with '$1'.

IDEONE demo:

$re = '~(?:<\w+|(?!^)\G)(?:\s+(?:src|href)=(?:"[^"]*"|\'[^\']*\'))*\s+(?!(?:href|src)=)\w+=\K(?|"([^\s"=]*)"|\'([^\s\'=]*)\')~u';
$str = "<mytag src=\"src_here\" b=\"yes\" href=\"href_here\"> becomes <mytag src=\"src_here\" b=yes href=\"href_here\">\n<mytag b='yes'> becomes <mytag b=yes>\nbut <script>var b=\"yes\"</script> stays intact\n<mytag b=\"no no\" c=\"no=no\"> stays intact\n<tag href=\"something\"> text <tag src=\"dddd\"> intact"; 
$subst = "$1"; 
$result = preg_replace($re, $subst, $str);
echo $result;

Pattern details:

(?:<\w+|(?!^)\G) - match the tag (<\w+) or (|) the end of the last successful match ((?!^)\G)
(?:\s+(?:src|href)=(?:"[^"]*"|\'[^\']*\'))* - matches the unwelcome href and src attributes to later omit them with \K
\s+ - match 1+ whitespace(s)
(?!(?:href|src)=)\w+= - 1+ alphanumeric or underscore characters (\w+) followed with = that are not href= or src= (see (?!(?:href|src)=) negative lookahead)
\K - omit the whole text matched so far
(?|"([^\s"=]*)"|\'([^\s\'=]*)\') - a branch reset group capturing into Group 1 either:
- "([^\s"=]*)" - double quoted attribute with no =, ' and whitespace
- | - or
- \'([^\s\'=]*)\' - single quoted attribute with no =, ' and whitespace

edited May 2, 2016 at 14:44

answered May 2, 2016 at 13:30

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

18 Comments

Tom Over a year ago

Thanks! This works on my initial testing of it. And I appreciate your clear explanation of the components of the regex!

Tom Over a year ago

I replied too soon...it seems that once "src" or "href" is encountered in a tag, all subsequent elements in that tag are ignored. Try putting

<img test="xxx" href="http://example.com/images/Vögel.jpg" alt="Vögel" height="100" width="100"> <img test="xxx" src="http://example.com/images/Vögel.jpg" alt="Vögel" height="100" width="100">

into your regex demo and you'll see.

Wiktor Stribiżew Over a year ago

Tom Over a year ago

Thanks! But I discovered a wrinkle: If the element has a hyphen in it, then it fails to recognize it as a tag element, and it and the subsequent attribute definitions are ignored; curiously, it's fine with underscores (_). Try:

<img test-time="xxx" src="http://example.com/images/Vögel.jpg" alt="Vögel" height="100" width="100"> <img test_time="xxx" src="http://example.com/images/Vögel.jpg" alt="Vögel" height="100" width="100">

Wiktor Stribiżew Over a year ago

It is rather unweildly, but this is what you have to pay for with such complex requirements:

(?:<[a-z][\w:.-]*|(?!^)\G)(?:\s+(?:(?:src|href)=(?:"[^"]*"|'[^']*')|[a-z][\w:.-]*="(?:[^"=]*\s[^"=]*|[^"\s]*=[^"\s]*)"))*\s+(?!(?:href|src)=)[a-z][\w:.-]*=\K(?|"([^\s"=]*)"|'([^\s'=]*)')

. All exceptions must be matched in the same branch with \G.

|

cdm · Accepted Answer · 2016-05-02 17:10:43Z

0

Use this (<[^=]*?(?<!href)(?<!src)=)"((\p{L}|\d)+)"(.*?>) and replace 1st, 2nd and 4th capturing group with preg_replace while the replacements occure.

$a = '<aaa href="123ff" bbb="aaa">';
do {
  $b = preg_replace('/(<[^>]*?(?<!href)(?<!src)=)"((\\p{L}|\\d)+)"(.*?>)/u', '$1$2$4', $a, -1, $count);
  if(!$count) {
    break;
  }
  $a = $b;
}while(true);

edited May 2, 2016 at 17:10

answered May 2, 2016 at 7:51

cdm

1,36011 silver badges18 bronze badges

2 Comments

Tom Over a year ago

Thanks, but this erroneously edits <script>var b="yes"</script>, as described in the first condition in my description. The loop suggests a long processing time.

cdm Over a year ago

I updated the regex. Now it should match things only in attributes.

Collectives™ on Stack Overflow

Seeking regex for HTML attributes meeting specific criteria

2 Answers 2

18 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

18 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related