PHP preg_match_all regex to extract only number in string

Question

I can't seem to figure out the proper regular expression for extracting just specific numbers from a string. I have an HTML string that has various img tags in it. There are a bunch of img tags in the HTML that I want to extract a portion of the value from. They follow this format:

<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />

So, varying lengths of numbers before what 'usually' is a .jpg (it may be a .gif, .png, or something else too). I want to only extract the number from that string.

The 2nd part of this is that I want to use that number to look up an entry in a database and grab the alt/title tag for that specific id of image. Lastly, I want to add that returned database value into the string and throw it back into the HTML string.

Any thoughts on how to proceed with it would be great...

Thus far, I've tried:

$pattern = '/img src="http://domain.com/images/[0-9]+\/.jpg';
preg_match_all($pattern, $body, $matches);
var_dump($matches);

You just need to use a capture group. What have you tried?

jordanm
– jordanm

2012-03-14 15:59:53 +00:00
Commented Mar 14, 2012 at 15:59 — jordanm
– jordanm, Commented Mar 14, 2012 at 15:59
post edited with what I've tried thus far

jpea
– jpea

2012-03-14 16:02:36 +00:00
Commented Mar 14, 2012 at 16:02 — jpea
– jpea, Commented Mar 14, 2012 at 16:02

nickb · Accepted Answer · 2012-03-14 16:10:47Z

2

I think this is the best approach:

Use an HTML parser to extract the image tags
Use a regular expression (or perhaps string manipulation) to extract the ID
Query for the data
Use the HTML parser to insert the returned data

Here is an example. There are improvements I can think of, such as using string manipulation instead of a regex.

$html = '<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />';
$doc = new DOMDocument;
$doc->loadHtml( $html);

foreach( $doc->getElementsByTagName('img') as $img)
{
    $src = $img->getAttribute('src');
    preg_match( '#/images/([0-9]+)\.#i', $src, $matches);
    $id = $matches[1];
    echo 'Fetching info for image ID ' . $id . "\n";

    // Query stuff here
    $result = 'Got this from the DB';

    $img->setAttribute( 'title', $result);
    $img->setAttribute( 'alt', $result);
}

$newHTML = $doc->saveHtml();

answered Mar 14, 2012 at 16:10

nickb

59.7k13 gold badges115 silver badges149 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

jpea Over a year ago

I love this approach, but how should I deal with the warnings of malformed HTML (the img tags are a hodge podge of XHTML with a trailing />).

nickb Over a year ago

The HTML parser should be pretty good with handling malformed HTML - Can you post a few examples of what's going wrong in your original post?

jpea Over a year ago

figured it out - it was just a warning, but it was parsing correctly so I just threw a @ in front of the loadHTML line. Another question though, instead of creating a whole HTML document to save, can I save just partial HTML? The string I'm searching isn't a whole document, but just a portion enclosed in <p> tags.

hakre Over a year ago

@jpea: See libxml_use_internal_errors and yes, loadHTML works with HTML chunks pretty well as well. Otherwise: sprintf("<body>%s</body>", $htmlChunk); - but this is not necessary in your case I assume. See as well my answer which is similar but differently.

kingcoyote · Accepted Answer · 2012-03-14 16:01:44Z

1

Using regular expressions, you can get the number really easily. The third argument for preg_match_all is a by-reference array that will be populated with the matches that were found.

preg_match_all('/<img src="http:\/\/domain.com\/images\/(\d+)\.[a-zA-Z]+"/', $html, $matches);
print_r($matches);

This would contain all of the stuff that it found.

answered Mar 14, 2012 at 16:01

kingcoyote

1,1469 silver badges22 bronze badges

Comments

Niet the Dark Absol · Accepted Answer · 2012-03-14 16:06:38Z

1

Consider using preg_replace_callback.

Use this regex: (images/([0-9]+)[^"]+")

Then, as the callback argument, use an anonymous function. Result:

$output = preg_replace_callback(
    "(images/([0-9]+)[^\"]+\")",
    function($m) {
        // $m[1] is the number.
        $t = getTitleFromDatabase($m[1]); // do whatever you have to do to get the title
        return $m[0]." title=\"".$t."\"";
    },
    $input
);

answered Mar 14, 2012 at 16:06

Niet the Dark Absol

326k86 gold badges480 silver badges604 bronze badges

Comments

Toto · Accepted Answer · 2012-03-14 16:20:32Z

1

use preg_match_all:

preg_match_all('#<img.*?/(\d+)\.#', $str, $m);
print_r($m);

output:

Array
(
    [0] => Array
        (
            [0] => <img src="http://domain.com/images/59.
            [1] => <img src="http://domain.com/images/549.
            [2] => <img src="http://domain.com/images/1249.
            [3] => <img src="http://domain.com/images/6.
        )

    [1] => Array
        (
            [0] => 59
            [1] => 549
            [2] => 1249
            [3] => 6
        )

)

edited Mar 14, 2012 at 16:20

answered Mar 14, 2012 at 16:02

Toto

91.7k63 gold badges97 silver badges135 bronze badges

1 Comment

jpea Over a year ago

that captures every number in the string, not just in <img> tags

Jon Grant · Accepted Answer · 2012-03-14 16:02:37Z

0

This regex should match the number parts:

\/images\/(?P<digits>[0-9]+)\.[a-z]+

Your $matches['digits'] should have all of the digits you want as an array.

answered Mar 14, 2012 at 16:02

Jon Grant

11.5k2 gold badges40 silver badges58 bronze badges

Comments

hakre · Accepted Answer · 2012-03-14 16:59:58Z

Regular expressions alone are a bit on the loosing ground when it comes to parsing crappy HTML. DOMDocument's HTML handling is pretty well to serve tagsoup hot and fresh, xpath to select your image srcs and a simple sscanf to extract the number:

$ids = array();
$doc = new DOMDocument();
$doc->loadHTML($html);
foreach(simplexml_import_dom($doc)->xpath('//img/@src[contains(., "/images/")]') as $src) {
    if (sscanf($src, '%*[^0-9]%d', $number)) {
        $ids[] = $number;
    }
}

Because that only gives you an array, why not encapsulate it?

$html = '<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />';

$imageNumbers = new ImageNumbers($html);

var_dump((array) $imageNumbers);

Which gives you:

array(4) {
  [0]=>
  int(59)
  [1]=>
  int(549)
  [2]=>
  int(1249)
  [3]=>
  int(6)
}

By that function above nicely wrapped into an ArrayObject:

class ImageNumbers extends ArrayObject
{
    public function __construct($html) {
        parent::__construct($this->extractFromHTML($html));
    }
    private function extractFromHTML($html) {
        $numbers = array();
        $doc = new DOMDocument();
        $preserve = libxml_use_internal_errors(TRUE);
        $doc->loadHTML($html);
        foreach(simplexml_import_dom($doc)->xpath('//img/@src[contains(., "/images/")]') as $src) {
            if (sscanf($src, '%*[^0-9]%d', $number)) {
                $numbers[] = $number;
            }
        }
        libxml_use_internal_errors($preserve);
        return $numbers;
    }
}

If your HTML should be that malformatted that even DOMDocument::loadHTML() can't handle it, then you only need to handle that internally in the ImageNumbers class.

Alan Moore · Accepted Answer · 2012-03-14 17:11:22Z

0

$matches = array();
preg_match_all('/[:digits:]+/', $htmlString, $matches);

Then loop through the matches array to both reconstruct the HTML and to do you look up in the database.

edited Mar 14, 2012 at 17:11

Alan Moore

75.6k13 gold badges110 silver badges161 bronze badges

answered Mar 14, 2012 at 16:04

Ed Heal

60.3k18 gold badges91 silver badges137 bronze badges

Collectives™ on Stack Overflow

PHP preg_match_all regex to extract only number in string

7 Answers 7

4 Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

4 Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related