PHP regex to gain data-id attribute

Question

I have this link: Alchemilla vulgaris. It is a google image link for images about a certain herb and I want to search the code of this web page for <div> tags with attribute data-id and extract the data id using preg_match_all.

I have this code but it does not show any results. I think the problem is in regular expression. Can you please help me get it right.

<!DOCTYPE HTML>
<html lang="sk">
    <head>
        <meta charset="UTF-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>Image searcher</title>
    </head>
    <body>
        <?php
            $search_query = "Alchemilla vulgaris";
            $search_query = urlencode( $search_query );
            $url = "https://www.google.com/search?q=$search_query&tbm=isch&ved=2ahUKEwi_0dbpjJrxAhUU_BoKHVkmDOwQ2-cCegQIABAA&oq=$search_query"; 
            echo $url;
            echo "\n"; 
            $html = file_get_contents( $url );
            preg_match_all('#<div\s.*?(?:data-id=[\'"](.*?)[\'"]).*?>#is',$html, $matches );
            var_dump($matches);
        ?>
    </body>
</html>

Thank you

Why not use the onboard functions of DOMDocument? Nobody is using regex for that. See stackoverflow.com/questions/26240471/… — Daniel W.
– Daniel W., Commented Jul 12, 2021 at 11:29

ŽaMan · Accepted Answer · 2021-07-26 13:58:35Z

1

First and foremost a necessary initiation if you haven't seen this https://stackoverflow.com/a/1732454/4907162

So yes, as pointed out in comments a true DOM/XML parser would be much more appropriate. Also regex has a time and place for its usage ... HTML parsing with regex really isn't the best thing out there but of course do-able for some things.

A few points to note:

(php resource) https://www.php.net/manual/en/function.file-get-contents.php#example-2121
(stackoverflow question) file_get_contents with context to change user agent didn't work

Google doesn't like bots scraping it - you might even get asked to solve a (re-)?captcha if you look like a bot. So at this time (this may change in the future maybe?) if your User-Agent doesn't match a "friendly" known UA then you get filtered out and get a different HTML result. I'm sure you may have done an echo $html; just to see you were getting content but if you manually search you will see the data generated does not include the data-id string you're trying to find.

So for your situation using the PHP function file_get_contents you'll want to do something like :

$opts = array('http' =>
  array(
    'header' => 'User-Agent: Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.67'
  )
);

$context  = stream_context_create($opts);

$html = file_get_contents( $url, false, $context );

For the regex it's a small change to make :

preg_match_all('#<div\s[^>]*?(?:data-id=[\'"](.*?)[\'"]).*?>#is',$html, $matches );

While I was trying to simply get the script to work at all I ended up creating this regex if you'd like to see another way.

preg_match_all('#<div\s+[^>]+data-id=[\'"]([^\'"]+)[\'"][^>]*>#is', $html, $matches )

Tim Toady Bicarbonate

To answer your comment in a way that I was able to find - maybe someone else can elaborate more:

In PHP, the context provided to file_get_contents allows to add additional information to call information from a HTTP/URL.

If you were to test file_get_contents on a URL for a server you own, you might notice in the logs the User-Agent is empty. At least on the server I'm using the User-Agent is an empty string. The context allows for specifying a User-Agent passed to the server you're trying to pull data from.

The server you're pulling data from processes the rest of the information. In the case of calling information from Google - they do check User-Agent information. You'll want to use a "known friendly" (as I call it) User-Agent.

The context of a stream allows to provide information that the server expects to see. Or at least that's what I can describe for PHP in context of file/url resource reading.

I hope this helps. I'll admit I'm not sure how to respond with more useful information.

edited Jul 26, 2021 at 13:58

answered Jul 12, 2021 at 15:04

ŽaMan

4066 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Oliver Kurnava Over a year ago

Thank you very much. Eventually I decide to use DOMDocument but I found you response very useful. Thank you

Oliver Kurnava Over a year ago

I have just one last question. The first 5 lines of code you send look like a declaration. I see it helps but I do not understand it. Can you please tell me how it works or the name of it so I can google more about it by myself. Thank you.

ŽaMan Over a year ago

Of course, it's a good question. You'll want to initially refer to information provided from php.net/manual/en/function.stream-context-create.php - the rest can be found in the PHP sources for what it's actually doing. I'll update the answer.

Collectives™ on Stack Overflow

PHP regex to gain data-id attribute

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related