1

I can't seem to figure out the proper regular expression for extracting just specific numbers from a string. I have an HTML string that has various img tags in it. There are a bunch of img tags in the HTML that I want to extract a portion of the value from. They follow this format:

<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />

So, varying lengths of numbers before what 'usually' is a .jpg (it may be a .gif, .png, or something else too). I want to only extract the number from that string.

The 2nd part of this is that I want to use that number to look up an entry in a database and grab the alt/title tag for that specific id of image. Lastly, I want to add that returned database value into the string and throw it back into the HTML string.

Any thoughts on how to proceed with it would be great...

Thus far, I've tried:

$pattern = '/img src="http://domain.com/images/[0-9]+\/.jpg';
preg_match_all($pattern, $body, $matches);
var_dump($matches);
2
  • You just need to use a capture group. What have you tried? Commented Mar 14, 2012 at 15:59
  • post edited with what I've tried thus far Commented Mar 14, 2012 at 16:02

7 Answers 7

2

I think this is the best approach:

  1. Use an HTML parser to extract the image tags
  2. Use a regular expression (or perhaps string manipulation) to extract the ID
  3. Query for the data
  4. Use the HTML parser to insert the returned data

Here is an example. There are improvements I can think of, such as using string manipulation instead of a regex.

$html = '<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />';
$doc = new DOMDocument;
$doc->loadHtml( $html);

foreach( $doc->getElementsByTagName('img') as $img)
{
    $src = $img->getAttribute('src');
    preg_match( '#/images/([0-9]+)\.#i', $src, $matches);
    $id = $matches[1];
    echo 'Fetching info for image ID ' . $id . "\n";

    // Query stuff here
    $result = 'Got this from the DB';

    $img->setAttribute( 'title', $result);
    $img->setAttribute( 'alt', $result);
}

$newHTML = $doc->saveHtml();
Sign up to request clarification or add additional context in comments.

4 Comments

I love this approach, but how should I deal with the warnings of malformed HTML (the img tags are a hodge podge of XHTML with a trailing />).
The HTML parser should be pretty good with handling malformed HTML - Can you post a few examples of what's going wrong in your original post?
figured it out - it was just a warning, but it was parsing correctly so I just threw a @ in front of the loadHTML line. Another question though, instead of creating a whole HTML document to save, can I save just partial HTML? The string I'm searching isn't a whole document, but just a portion enclosed in <p> tags.
@jpea: See libxml_use_internal_errors and yes, loadHTML works with HTML chunks pretty well as well. Otherwise: sprintf("<body>%s</body>", $htmlChunk); - but this is not necessary in your case I assume. See as well my answer which is similar but differently.
1

Using regular expressions, you can get the number really easily. The third argument for preg_match_all is a by-reference array that will be populated with the matches that were found.

preg_match_all('/<img src="http:\/\/domain.com\/images\/(\d+)\.[a-zA-Z]+"/', $html, $matches);
print_r($matches);

This would contain all of the stuff that it found.

Comments

1

Consider using preg_replace_callback.

Use this regex: (images/([0-9]+)[^"]+")

Then, as the callback argument, use an anonymous function. Result:

$output = preg_replace_callback(
    "(images/([0-9]+)[^\"]+\")",
    function($m) {
        // $m[1] is the number.
        $t = getTitleFromDatabase($m[1]); // do whatever you have to do to get the title
        return $m[0]." title=\"".$t."\"";
    },
    $input
);

Comments

1

use preg_match_all:

preg_match_all('#<img.*?/(\d+)\.#', $str, $m);
print_r($m);

output:

Array
(
    [0] => Array
        (
            [0] => <img src="http://domain.com/images/59.
            [1] => <img src="http://domain.com/images/549.
            [2] => <img src="http://domain.com/images/1249.
            [3] => <img src="http://domain.com/images/6.
        )

    [1] => Array
        (
            [0] => 59
            [1] => 549
            [2] => 1249
            [3] => 6
        )

)

1 Comment

that captures every number in the string, not just in <img> tags
0

This regex should match the number parts:

\/images\/(?P<digits>[0-9]+)\.[a-z]+

Your $matches['digits'] should have all of the digits you want as an array.

Comments

0

Regular expressions alone are a bit on the loosing ground when it comes to parsing crappy HTML. DOMDocument's HTML handling is pretty well to serve tagsoup hot and fresh, xpath to select your image srcs and a simple sscanf to extract the number:

$ids = array();
$doc = new DOMDocument();
$doc->loadHTML($html);
foreach(simplexml_import_dom($doc)->xpath('//img/@src[contains(., "/images/")]') as $src) {
    if (sscanf($src, '%*[^0-9]%d', $number)) {
        $ids[] = $number;
    }
}

Because that only gives you an array, why not encapsulate it?

$html = '<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />';

$imageNumbers = new ImageNumbers($html);

var_dump((array) $imageNumbers);

Which gives you:

array(4) {
  [0]=>
  int(59)
  [1]=>
  int(549)
  [2]=>
  int(1249)
  [3]=>
  int(6)
}

By that function above nicely wrapped into an ArrayObject:

class ImageNumbers extends ArrayObject
{
    public function __construct($html) {
        parent::__construct($this->extractFromHTML($html));
    }
    private function extractFromHTML($html) {
        $numbers = array();
        $doc = new DOMDocument();
        $preserve = libxml_use_internal_errors(TRUE);
        $doc->loadHTML($html);
        foreach(simplexml_import_dom($doc)->xpath('//img/@src[contains(., "/images/")]') as $src) {
            if (sscanf($src, '%*[^0-9]%d', $number)) {
                $numbers[] = $number;
            }
        }
        libxml_use_internal_errors($preserve);
        return $numbers;
    }
}

If your HTML should be that malformatted that even DOMDocument::loadHTML() can't handle it, then you only need to handle that internally in the ImageNumbers class.

Comments

0
$matches = array();
preg_match_all('/[:digits:]+/', $htmlString, $matches);

Then loop through the matches array to both reconstruct the HTML and to do you look up in the database.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.