Retrieving remote pages and parsing html

Question

This is my code:

    <?php
$begin = microtime(TRUE);
$result = get_web_page('WEB_PAGE');

$dom = new DOMDocument();
$dom->loadHTML($result['content']);
$xpath = new DOMXPath($dom);

// Get the links
$matches = $xpath->evaluate('//li[@class = "lasts"]/a[@class = "lnk"]/@href | //li[@class=""]/a[ @class = "lnk"]/@href');
if ($matches === FALSE) {
    echo 'error';
    exit();
}
foreach ($matches as $match) {
    $links[] = 'WEB_PAGE'.$match->value;
}

$index = 0;

// For each link
foreach ($links as $link) {
    echo (string)($index).' loop '.(string)(microtime(TRUE)-$begin).'<br>';
    $result = get_web_page($link);

    $dom = new DOMDocument();
    $dom->loadHTML($result['content']);
    $xpath = new DOMXPath($dom);

    $match = $xpath->evaluate('concat(//span[@id = "header"]/span[@id = "sub_header"]/text(), //span[@id = "header"]/span[@id = "sub_header"]/following-sibling::text()[1])');
    if ($matches === FALSE) {
        exit();
    }
    $data[$index]['name'] = $match;

    $matches = $xpath->evaluate('//li[starts-with(@class, "active")]/a/text()');
    if ($matches === FALSE) {
        exit();
    }
    foreach ($matches as $match) {
        $data[$index]['types'][] = $match->data;
    }

    $matches = $xpath->evaluate('//span[@title = "this is a title" and @class = "info"]/text()');
    if ($matches === FALSE) {
        exit();
    }
    foreach ($matches as $match) {
        $data[$index]['info'][] = $match->data;
    }

    $matches = $xpath->evaluate('//span[@title = "this is another title" and @class = "name"]/text()');
    if ($matches === FALSE) {
        exit();
    }
    foreach ($matches as $match) {
        $data[$index]['names'][] = $match->data;
    }

    ++$index;
}

?>

This is what is being printed:

0 loop 1.66981506348
1 loop 2.49688410759
2 loop 3.00950098038
3 loop 3.5253970623
4 loop 4.01076102257
5 loop 4.67162799835
6 loop 5.2378718853
7 loop 5.74008488655
8 loop 6.26041197777
9 loop 6.78747105598
10 loop 7.47332000732
11 loop 8.03243994713
12 loop 8.50538802147
13 loop 9.37472701073
14 loop 11.5049209595
15 loop 12.2112920284
16 loop 12.6640410423
17 loop 13.1369791031
18 loop 13.8875179291
19 loop 14.4746370316
20 loop 14.9760200977
21 loop 15.5332159996
22 loop 16.1946868896
23 loop 17.0584990978
24 loop 17.840462923
25 loop 18.6889989376
26 loop 19.6185629368
27 loop 20.8282380104
28 loop 22.0119960308
29 loop 22.9078469276
30 loop 24.0000309944
31 loop 24.6960549355
32 loop 25.1580710411
33 loop 25.5702528954
34 loop 26.2709059715
35 loop 26.7621939182
36 loop 27.2691950798
37 loop 27.88843894
38 loop 28.6984479427
39 loop 29.4622280598
40 loop 30.2815680504
41 loop 31.1307020187

What it does is basically connects to a remote page, retrieves it and parses links from that page.
Then for each link, it retrieves it and parses it, creates a new data structure and then uses it (the part of using the structure is not included here).
The output that I added is the time that has passed since the beginning of the runtime of the script.
It takes about 30 seconds for this code to run, which is insane.
How can I improve it?

Can you please edit your question to explain what is your code supposed to do ? — webNeat
– webNeat, Commented Aug 5, 2016 at 14:43
First, make sure what is the cause of your problem. It should be mainly get_web_page function call. But in case of bigger HTML data, your parsing could take some time. — vfsoraki
– vfsoraki, Commented Aug 5, 2016 at 22:31
@thelastblack get_web_page is just a function that uses curl with some custom headers and returns the output. I think the case is a big HTML data. — Lior
– Lior, Commented Aug 6, 2016 at 0:30
Still, benchmark it and post results. Don't just think. And as @MikeBrant's answer suggests, try using curl_multi_exec. — vfsoraki
– vfsoraki, Commented Aug 6, 2016 at 6:26

Mike Brant · Accepted Answer · 2016-08-05 16:27:40Z

2

Really, the best way to optimize performance is to make the HTTP requests in parallel. After you have done that, you could consider further optimizations with regards to parsing.

Consider using curl_multi_exec() or similar for this.

I have a REST client based on curl_multi_exec() that you can feel free to take a look at for inspiration (or just use as-is in your application based on MIT license).

https://github.com/mikecbrant/php-rest-client

answered Aug 5, 2016 at 16:27

Mike Brant

9,87814 silver badges24 bronze badges

Add a comment |

Stack Exchange Network

Retrieving remote pages and parsing html

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Retrieving remote pages and parsing html

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions