0

I want to parse out HTML from a string selectively. I have used strip_tags to allow div's, but I don't want to keep the div styles/classes from the string. That is, I want:

<div class="something">text</div>
<div style="something">text</div>

to simply become:

<div>text</div>

in both cases.

Can anyone help? Thanks!

3

4 Answers 4

2

replace the following regex with nothing:

(?<=<div.*?)(?<!=\t*?"?\t*?)(class|style)=".*?"
Sign up to request clarification or add additional context in comments.

5 Comments

What if there is an attribute containing class= or style= like <div title="style=" class="foo">?
@J V: That won’t fix it, see for example <div title=" style=" class="foo">.
Ok, it's getting complicated now, but I think I got it... Honestly, if the html is so screwed up regex is the last thing to worry about :)
Never mind the whitespace, this regex won't work because it requires variable-length lookbehinds, and PHP (like most flavors) doesn't do that. Lookbehinds should never be your first resort anyway; there's almost always an easier way.
Ah, in that case I cave to vincent :)
1

Here is an example:

preg_replace('`<div (style="[^"]*"|class="[^"]*")>([^<]*)</div>`i', "<div>$1</div>", $str);

Basically, this matches the content of a div with a style or a class attribute. Then, you remove everything to keep only <div>content</div>.

It's longer than J V's version, but it won't replace something like <div style="blablabla" color="blablabla">content</div>, for instance. May or may not be what you want.

5 Comments

I see a problem using the very example the OP gave :) (Hint, repeaters are greedy)
Actually, the . class is greedy. [^"] is not, it stops after the first " encountered. No worries, I test my code before I post (usually at least!)
Think about it, it doesn't make sense. I have a class that matches every character but ". What happens when it encounters a "? It stops matching. This has nothing to do with * or any quantifier. As I said, I tested my code with OP's example, it works correctly.
Ah yes I see... Although mine only deletes the style/class attribute itself so any other attributes remain.
The problem with this code is that, if we have another attribute before class or style attribute (example: title="my page"), it will not work.
0

As an option to regexp (which always freaks me out), I'd suggest so use xml_parse_into_struct.

See at php.net and it's first example.

Comments

0

I found out it's very difficult to build a single regex that, in a single pass, remove simultaneously class and style attributes inside a tag. That's because we don't know where this attributes will appear, together with other attributes inside the tag (supposing that we want to preserve the other ones). However, we can achieve that, splitting this task in two simpler search and replace operations: one for the class attribute and another for the style attribute.

To capture the first part of a div containing a class attribute, with one or more values enclosed in double quotes, the regex is as follows:

(<div\s+)([^>]*)(class\s*=\s*\"[^\">]*\")(\s|/|>)

The same code modified for single quotes:

(<div\s+)([^>]*)(class\s*=\s*\'[^\'>]*\')(\s|/|>)

Or no quotes:

(<div\s+)([^>]*)(class\s*=\s*[^\"\'=/>\s]+)(\s|/|>)

The captured string must then be replaced by the first, second and fourth capture group which, in PHP preg_replace() code, is represented by the string $1$2$4.

To eliminate a style attribute, instead a class one, just replace the substring class by the substring style in the regex. To eliminate these attributes in any tag (not only divs), replace the substring div by the substring [a-z][a-z0-9]* in the regex

Note: the regex above will not eliminate class or style attributes with syntax errors. Example: class="xxxxx (missing a quote after the value), class='xxxxx'' (excess of quotes after the value), class="xxxx"title="yyyy" (no space between attributes), and so on.

Short explanation:

<div\s+                  # beginning of the div tag, followed by one or more whitespaces
[^>]*                    # any set of attributes before the class (optional)
class\s*=\s*\"[^\">]*\"  # class attribute, with optional whitespaces
\s|/|>                   # one of these characters always follows the end of an attribute

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.