0

I am using preg_replace to strip out <p> tags and <li> tags and making them carriage returns. I have some <a> tags in my string, and I want to strip those out, but keep the href attribute. For instance, if I have: <a href = "http://www.example.com">Click Here</a>, what I want is: http://www.example.com Click Here

Here is what I have so far

$text .= preg_replace(array("/<p[^>]*>/iU","/<\/p[^>]*>/iU","/<ul[^>]*>/iU","/<\/ul[^>]*>/iU","/<li[^>]*>/iU","/<\/li[^>]*>/iU"), array("","\r\n\r\n","","\r\n\r\n","","\r\n"), $content);

Thanks

1
  • 3
    Your life would probably be much easier if you used an HTML parser instead. Commented Mar 30, 2011 at 1:52

2 Answers 2

3

If I were you I would use SimpleHTMLDom. Here's a usage example from the docs:

// Create DOM from string
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');

$html->find('div', 1)->class = 'bar';

$html->find('div[id=hello]', 0)->innertext = 'foo';

echo $html; 
// Output: <div id="hello">foo</div><div id="world" class="bar">World</div>
Sign up to request clarification or add additional context in comments.

Comments

0

If a regex solution is desired, here is a tested function which handles the anchor tags as you requested (with notable caveats noted below.) The regex is presented in verbose mode with comments:

function process_markup($content) {
    return preg_replace(
        array( // Regex patterns
            '%<(?:p|ul|li)[^>]*>%i',        // Open tags.
            '%<\/(?:p|ul|li)[^>]*>\s*%i',   // Close tags.
            '% # Match A element (with no "<>" in attributes!)
            <a\b         # Start tag name.
            [^>]+?       # anything up to HREF attribute.
            href\s*=\s*  # HREF attribute name and "="
            (["\']?)     # $1: Optional quote delimiter
            ([^>\s]+)    # $2: HREF attribute value.
            (?(1)\1)     # If open quote, match close quote.
            [^>]*>       # Remainder of start tag
            (.*?)        # $3: A element contents.
            </a\s*>      # A element end tag.
            %ix'
        ),
        array( // Replacement strings
            "",          # Simply strip P, UL, and LI open tags.
            "\r\n",      # Replace close tags with line endings.
            "$2 $3"      # Keep A element HREF value and contents.
        ), $content);
}

I took the liberty of modifying the other regexes as well. Adjust as necessary.

CAVEATS: This regex solution assumes: All A, P, UL and LI elements have no angle brackets <>in their attributes. There are no A, P, UL or LI element start or end tags within any CDATA sections such as SCRIPT or STYLE elements, or HTML comments, or inside other start tag attributes. Otherwise, this should work pretty well for a lot of HTML markup.

I realize that many wince when they hear the words: HTML and REGEX spoken in the same breath, but in this particular case, I think a regex solution will work quite well (within the above limitations). The A tag is one of those which is not nested, so a regex can easily match the start tag, contents and end tag all in one whack. Same thing with the individual start and end tags for the other elements (which can be nested) when considered independently.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.