Scraping through CSS selectors

Question

I need to write a scraper in Java + Groovy..

I was wondering if something able to parse HTML documents and select the informations I need through simple CSS selectors (instead that going through the whole document tree and manually select what I need) exists? Something like Nokogiri for Ruby, just to give you the idea of what I need..

thanks in advance!

My first thought: Finally, someone who didn't ask this question in relation to regular expressions.;) Of course, this has been covered in detail. — ChrisLively
– ChrisLively, Commented Nov 15, 2010 at 22:40
I've been using C# for scraping. I've written a jQuery port, but I don't dare post it here for fear of being down-voted into oblivion due to self-promotion. — mpen
– mpen, Commented Nov 17, 2010 at 5:13
so what if you get marked down. I would be interested to see it and I wouldn't be the only one. — hoju
– hoju, Commented Nov 18, 2010 at 4:55

hoju · Accepted Answer · 2010-11-18 05:00:08Z

1

I do something like this by loading a page with Qt Webkit and including JQuery.

It's a hack but works well for my use case. I needed a solution that requires no configuration - just sudo apt-get install libqt4-webkit and you're ready to go.

edited Nov 18, 2010 at 5:00

answered Nov 17, 2010 at 4:37

hoju

29.7k40 gold badges138 silver badges178 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Reverend Gonzo · Accepted Answer · 2010-11-15 22:39:59Z

0

If you can be backed by a browser (as in use a browser to render and create the pages), selenium would be perfect. this would have the added benefit of having full support for Ajax websites.

If not, something like webdriver would probably work.

I've only used Selenium.

answered Nov 15, 2010 at 22:39

Reverend Gonzo

41.1k6 gold badges61 silver badges78 bronze badges

Comments

DisappointedByUnaccountableMod · Accepted Answer · 2021-06-10 14:26:06Z

0

I use Selenium RC + jQuery for screen scraping.

Example code: HERE

While I use PHP as the client, but you can implement it using any language you like (as long as it can talk to Selenium RC).

I have tried several CSS selector libraries before, but honestly, the best parser is your browser, Selenium RC approach is not fast but superb reliable.

edited Jun 10, 2021 at 14:26

DisappointedByUnaccountableMod

6,8444 gold badges21 silver badges23 bronze badges

answered Nov 17, 2010 at 16:50

tszming

2,07412 silver badges15 bronze badges

Collectives™ on Stack Overflow

Scraping through CSS selectors

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related