PHP regex: How to convert HTML string with links into plain text that shows URL after text in brackets

Question

I'm looking for a way in PHP (with regex, maybe?) to convert a string of HTML that includes links into a string of plain text that adds the URL of the link after the text.

Here's an example of what I'm thinking:

$html = '<p><a href="http://www.example.com/maybe/something/here/">Link name</a> 
        for something or another. <a href="https://www.examplesecure.com/">Another link
        </a> to something else.</p>'

// Regex to find the URLs
????

// Add the found URLs as strings after the closing a tags
????

// Convert to plain text
$text = trim(strip_tags($html));

Ideally, I'd end up with this string:

Link name [http://www.example.com/maybe/something/here/] for something or another.
Another link [https://www.examplesecure.com/] to something else.

Do not use Regular Expressions to parse HTML. Never. Ever. Try DomCrawler, or DomDocument for a native solution. — BugHunterUK
– BugHunterUK, Commented Feb 21, 2016 at 19:32
I would use a DOM parser like Simple HTML DOM Parser to explode html. If you just are looking for the URL's, take a look at stackoverflow.com/questions/11588542/… — redelschaap
– redelschaap, Commented Feb 21, 2016 at 19:35

Aleksey Ratnikov · Accepted Answer · 2016-02-24 09:07:31Z

0

Use SimpleXMLelemnt for this:

    $htmlString = '<div><p><a href="/page">some text</a><non-standart-tag><a href="/page-2">more text</a></non-standart-tag>';

libxml_use_internal_errors(true); //suppress errors when importing invalid HTML
$dom = new DOMDocument();
$dom->loadHTML($htmlString);
$xpath = new DOMXPath($dom);

$links = [];
$linksAsString = '';

foreach ($xpath->query('//a') as $linkElement){
    /**
     * @var DOMElement $linkElement
     */
    $link = [
        'href' => $linkElement->getAttribute('href'),
        'text' => $linkElement->textContent
    ];
    $links[] = $link;
    $linksAsString .= $link['text'] . "[{$link['href']}] ";
}
libxml_clear_errors();

var_dump($links);
echo $linksAsString;

edited Feb 24, 2016 at 9:07

answered Feb 21, 2016 at 23:12

Aleksey Ratnikov

5593 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Aleksey Ratnikov Over a year ago

Added code to implode links into string with square brackets.

isabisa Over a year ago

I would love to get this to work because it looks like a great elegant solution. But I can't get my HTML to validate because it's generated by a WP script that I can't control and so it won't parse as XML. :(

Aleksey Ratnikov Over a year ago

I don't understand - in question you said that you want to extract links from HTML with regex, but now you say that you can't get HTML. If you can't intercept WP script, you may just get it's out - it is and HTML anyway.

isabisa Over a year ago

It's a string that contains HTML, but the HTML isn't validating. I just don't have control over the HTML that is output and can't make it validate, so it won't convert to XML.

Aleksey Ratnikov Over a year ago

Got it. Corrected answer to use DOMDocument that is able to process invalid HTML (SimpleXMLElement can't).

Collectives™ on Stack Overflow

PHP regex: How to convert HTML string with links into plain text that shows URL after text in brackets

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related