PHP regular expression help

Question

I am using preg_replace to strip out <p> tags and <li> tags and making them carriage returns. I have some <a> tags in my string, and I want to strip those out, but keep the href attribute. For instance, if I have: <a href = "http://www.example.com">Click Here</a>, what I want is: http://www.example.com Click Here

Here is what I have so far

$text .= preg_replace(array("/<p[^>]*>/iU","/<\/p[^>]*>/iU","/<ul[^>]*>/iU","/<\/ul[^>]*>/iU","/<li[^>]*>/iU","/<\/li[^>]*>/iU"), array("","\r\n\r\n","","\r\n\r\n","","\r\n"), $content);

Thanks

Your life would probably be much easier if you used an HTML parser instead. — GWW
– GWW, Commented Mar 30, 2011 at 1:52

sepehr · Accepted Answer · 2016-11-05 14:15:39Z

3

If I were you I would use SimpleHTMLDom. Here's a usage example from the docs:

// Create DOM from string
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');

$html->find('div', 1)->class = 'bar';

$html->find('div[id=hello]', 0)->innertext = 'foo';

echo $html; 
// Output: <div id="hello">foo</div><div id="world" class="bar">World</div>

edited Nov 5, 2016 at 14:15

answered Mar 30, 2011 at 1:57

sepehr

18.7k7 gold badges87 silver badges120 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ridgerunner · Accepted Answer · 2011-03-30 04:55:03Z

If a regex solution is desired, here is a tested function which handles the anchor tags as you requested (with notable caveats noted below.) The regex is presented in verbose mode with comments:

function process_markup($content) {
    return preg_replace(
        array( // Regex patterns
            '%<(?:p|ul|li)[^>]*>%i',        // Open tags.
            '%<\/(?:p|ul|li)[^>]*>\s*%i',   // Close tags.
            '% # Match A element (with no "<>" in attributes!)
            <a\b         # Start tag name.
            [^>]+?       # anything up to HREF attribute.
            href\s*=\s*  # HREF attribute name and "="
            (["\']?)     # $1: Optional quote delimiter
            ([^>\s]+)    # $2: HREF attribute value.
            (?(1)\1)     # If open quote, match close quote.
            [^>]*>       # Remainder of start tag
            (.*?)        # $3: A element contents.
            </a\s*>      # A element end tag.
            %ix'
        ),
        array( // Replacement strings
            "",          # Simply strip P, UL, and LI open tags.
            "\r\n",      # Replace close tags with line endings.
            "$2 $3"      # Keep A element HREF value and contents.
        ), $content);
}

I took the liberty of modifying the other regexes as well. Adjust as necessary.

CAVEATS: This regex solution assumes: All A, P, UL and LI elements have no angle brackets <>in their attributes. There are no A, P, UL or LI element start or end tags within any CDATA sections such as SCRIPT or STYLE elements, or HTML comments, or inside other start tag attributes. Otherwise, this should work pretty well for a lot of HTML markup.

I realize that many wince when they hear the words: HTML and REGEX spoken in the same breath, but in this particular case, I think a regex solution will work quite well (within the above limitations). The A tag is one of those which is not nested, so a regex can easily match the start tag, contents and end tag all in one whack. Same thing with the individual start and end tags for the other elements (which can be nested) when considered independently.

Collectives™ on Stack Overflow

PHP regular expression help

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related