2

I'm looking for a way in PHP (with regex, maybe?) to convert a string of HTML that includes links into a string of plain text that adds the URL of the link after the text.

Here's an example of what I'm thinking:

$html = '<p><a href="http://www.example.com/maybe/something/here/">Link name</a> 
        for something or another. <a href="https://www.examplesecure.com/">Another link
        </a> to something else.</p>'

// Regex to find the URLs
????

// Add the found URLs as strings after the closing a tags
????

// Convert to plain text
$text = trim(strip_tags($html));

Ideally, I'd end up with this string:

Link name [http://www.example.com/maybe/something/here/] for something or another.
Another link [https://www.examplesecure.com/] to something else.
3
  • 2
    Do not use Regular Expressions to parse HTML. Never. Ever. Try DomCrawler, or DomDocument for a native solution. Commented Feb 21, 2016 at 19:32
  • I would use a DOM parser like Simple HTML DOM Parser to explode html. If you just are looking for the URL's, take a look at stackoverflow.com/questions/11588542/… Commented Feb 21, 2016 at 19:35
  • So no preg_match, just to find the URLs? Commented Feb 21, 2016 at 19:47

1 Answer 1

0

Use SimpleXMLelemnt for this:

    $htmlString = '<div><p><a href="/page">some text</a><non-standart-tag><a href="/page-2">more text</a></non-standart-tag>';

libxml_use_internal_errors(true); //suppress errors when importing invalid HTML
$dom = new DOMDocument();
$dom->loadHTML($htmlString);
$xpath = new DOMXPath($dom);

$links = [];
$linksAsString = '';

foreach ($xpath->query('//a') as $linkElement){
    /**
     * @var DOMElement $linkElement
     */
    $link = [
        'href' => $linkElement->getAttribute('href'),
        'text' => $linkElement->textContent
    ];
    $links[] = $link;
    $linksAsString .= $link['text'] . "[{$link['href']}] ";
}
libxml_clear_errors();

var_dump($links);
echo $linksAsString;
Sign up to request clarification or add additional context in comments.

5 Comments

Added code to implode links into string with square brackets.
I would love to get this to work because it looks like a great elegant solution. But I can't get my HTML to validate because it's generated by a WP script that I can't control and so it won't parse as XML. :(
I don't understand - in question you said that you want to extract links from HTML with regex, but now you say that you can't get HTML. If you can't intercept WP script, you may just get it's out - it is and HTML anyway.
It's a string that contains HTML, but the HTML isn't validating. I just don't have control over the HTML that is output and can't make it validate, so it won't convert to XML.
Got it. Corrected answer to use DOMDocument that is able to process invalid HTML (SimpleXMLElement can't).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.