1

I'm trying to remove single quotes and double quotes around HTML attributes with the following restrictions:

1) The quoted material MUST exist within a tag <> (e.g., <mytag b="yes"> becomes <mytag b=yes>, but <script>var b="yes"</script> stays intact).

2) The quoted material may not have a space character nor an equal sign (e.g., <mytag b="no no" c="no=no"> stays intact).

3) The quoted material may not be in an href or src definition.

4) The regex should be good for UTF-8 (duh!)

Someone posted a virtually identical question here that received an answer that works within the confines of the question:

Removing single and double quote from html attributes with no white spaces on all attributes except href and src

So:

((\S)+\s*(?<!href)(?<!src)(=)\s*)(\"|\')(\S+)(\"|\')

...works, except it fails to isolate text within tags (i.e., text in between opening and closing tags is erroneously edited, e.g. <mytag>"The quotes are stripped out here!"</mytag>), and it doesn't check for equal signs (=) within the quoted text (e.g. <mytag b="OhNo=TheRoutineRemovedTheQuotesBecauseItDidNotCheckForAnEqualSignInTheQuotedText!">).

Bonus points: I wish to integrate this into this php HTML minification routine, which works well except for the edits described above:

https://gist.github.com/tovic/d7b310dea3b33e4732c0

His solution pairs the patterns and replacement params in two arrays, as you'll see, so I need to conform to his syntax, which uses #, etc.

Your solution get my upvote!

2
  • 1
    This seems like a bad idea. You should try using an HTML parser instead. Commented May 2, 2016 at 4:54
  • You'd be better off using a DocumentFragment Commented May 2, 2016 at 14:48

2 Answers 2

1

Here is a pure regex way of getting rid of the quotes:

'~(?:<\w+|(?!^)\G)(?:\s+(?:src|href)=(?:"[^"]*"|'[^']*'))*\s+(?!(?:href|src)=)\w+=\K(?|"([^\s"=]*)"|'([^\s'=]*)')~u'

See the regex demo, replace with '$1'.

IDEONE demo:

$re = '~(?:<\w+|(?!^)\G)(?:\s+(?:src|href)=(?:"[^"]*"|\'[^\']*\'))*\s+(?!(?:href|src)=)\w+=\K(?|"([^\s"=]*)"|\'([^\s\'=]*)\')~u';
$str = "<mytag src=\"src_here\" b=\"yes\" href=\"href_here\"> becomes <mytag src=\"src_here\" b=yes href=\"href_here\">\n<mytag b='yes'> becomes <mytag b=yes>\nbut <script>var b=\"yes\"</script> stays intact\n<mytag b=\"no no\" c=\"no=no\"> stays intact\n<tag href=\"something\"> text <tag src=\"dddd\"> intact"; 
$subst = "$1"; 
$result = preg_replace($re, $subst, $str);
echo $result;

Pattern details:

  • (?:<\w+|(?!^)\G) - match the tag (<\w+) or (|) the end of the last successful match ((?!^)\G)
  • (?:\s+(?:src|href)=(?:"[^"]*"|\'[^\']*\'))* - matches the unwelcome href and src attributes to later omit them with \K
  • \s+ - match 1+ whitespace(s)
  • (?!(?:href|src)=)\w+= - 1+ alphanumeric or underscore characters (\w+) followed with = that are not href= or src= (see (?!(?:href|src)=) negative lookahead)
  • \K - omit the whole text matched so far
  • (?|"([^\s"=]*)"|\'([^\s\'=]*)\') - a branch reset group capturing into Group 1 either:
    • "([^\s"=]*)" - double quoted attribute with no =, ' and whitespace
    • | - or
    • \'([^\s\'=]*)\' - single quoted attribute with no =, ' and whitespace
Sign up to request clarification or add additional context in comments.

18 Comments

Thanks! This works on my initial testing of it. And I appreciate your clear explanation of the components of the regex!
I replied too soon...it seems that once "src" or "href" is encountered in a tag, all subsequent elements in that tag are ignored. Try putting <img test="xxx" href="http://example.com/images/Vögel.jpg" alt="Vögel" height="100" width="100"> <img test="xxx" src="http://example.com/images/Vögel.jpg" alt="Vögel" height="100" width="100"> into your regex demo and you'll see.
You may match them before omitting with \K: (?:<\w+|(?!^)\G)(?:\s+(?:src|href)=(?:"[^"]*"|'[^']*'))*\s+(?!(?:href|src)=)\w+=\K(?|"([^\s"=]*)"|'([^\s'=]*)'). I updated the regex, demos and explanation.
Thanks! But I discovered a wrinkle: If the element has a hyphen in it, then it fails to recognize it as a tag element, and it and the subsequent attribute definitions are ignored; curiously, it's fine with underscores (_). Try: <img test-time="xxx" src="http://example.com/images/Vögel.jpg" alt="Vögel" height="100" width="100"> <img test_time="xxx" src="http://example.com/images/Vögel.jpg" alt="Vögel" height="100" width="100">
It is rather unweildly, but this is what you have to pay for with such complex requirements: (?:<[a-z][\w:.-]*|(?!^)\G)(?:\s+(?:(?:src|href)=(?:"[^"]*"|'[^']*')|[a-z][\w:.-]*="(?:[^"=]*\s[^"=]*|[^"\s]*=[^"\s]*)"))*\s+(?!(?:href|src)=)[a-z][\w:.-]*=\K(?|"([^\s"=]*)"|'([^\s'=]*)'). All exceptions must be matched in the same branch with \G.
|
0

Use this (<[^=]*?(?<!href)(?<!src)=)"((\p{L}|\d)+)"(.*?>) and replace 1st, 2nd and 4th capturing group with preg_replace while the replacements occure.

$a = '<aaa href="123ff" bbb="aaa">';
do {
  $b = preg_replace('/(<[^>]*?(?<!href)(?<!src)=)"((\\p{L}|\\d)+)"(.*?>)/u', '$1$2$4', $a, -1, $count);
  if(!$count) {
    break;
  }
  $a = $b;
}while(true);

2 Comments

Thanks, but this erroneously edits <script>var b="yes"</script>, as described in the first condition in my description. The loop suggests a long processing time.
I updated the regex. Now it should match things only in attributes.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.