0

I am going to scrape some contents from a website that use javascript to load dynamic content. Before, I have used request and cheerio to scrape and they worked just fine. But I just find out that request and cheerio cannot scrape dynamic content. After do a research, I found phantomjs that can get all the content after the page has loaded. I have a problem with it now, I cannot use jQuery selector like I used to use in cheerio. This is my sample code but the selector is return nothing.

var page = require('webpage').create();
var url = 'http://angkorauto.com/vehicle';
page.open(url, function (status) {
    if (status !== 'success') {
        console.log('Unable to load the address!');
        phantom.exit();
    } else {
        window.setTimeout(function () {
            // console.log(page.content);
            page.includeJs('https://cdnjs.cloudflare.com/ajax/libs/jquery/3.1.1/jquery.min.js', function(){

                page.evaluate(function(){
                    console.log($('.divTitle').find('a').attr('href'));
                });
            });

            phantom.exit();
        }, 1500);
    }
});

Could you help me with this problem? I really get stuck now.

Thanks for ur time to help.

3
  • You want to scrape from only this website (ankkorauto.com) or some other websites too? Commented Dec 10, 2016 at 15:44
  • I want other too if request and cheerio cannot be used Commented Dec 11, 2016 at 4:01
  • what??? what is cheerio? Commented Dec 11, 2016 at 4:02

1 Answer 1

2

The website you want to scrape has jQuery already (like many other websites) so you don't have load it again.

This works fine:

var page = require('webpage').create();
var url = 'http://angkorauto.com/vehicle';
page.open(url, function(status) {

    var href = page.evaluate(function(){
        return jQuery('.divTitle').find('a').attr('href');
    });

    console.log(href);
});
Sign up to request clarification or add additional context in comments.

2 Comments

I will try this
This code solves two problems without clearly stating why. 1. Loading a another jQuery version tends to break all jQuery functionality on the site. So if it's already available, it should not be loaded. 2. console.log calls inside of the page context (inside page.evaluate) are not printed to the console by default. They would need the page.onConsoleMessage event handler. (@DooDoo)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.