33

Is there a PHP class/library that would allow me to query an XHTML document with CSS selectors? I need to scrape some pages for data that is very easily accessible if I could somehow use CSS selectors (jQuery has spoiled me!). Any ideas?

8 Answers 8

44

After Googling further (initial results weren't very helpful), it seems there is actually a Zend Framework library for this, along with some others:

Sign up to request clarification or add additional context in comments.

5 Comments

+1 phpQuery is absolutely wonderful.
I tried out 3 of the items you listed. In the end, my choice is Simple HTML DOM, purely because they explain it's usage very simply and well put. phpQuery got the job done, but I felt as if there was a lack of documentation and support. Zend successfully grabbed my query and counted it, but when it came to getting the values, it failed. Again, my suggestion is Simple HTML DOM.
Although simple html dom is quite popular, a) it doesn't have good coverage of the full selector syntax b) it doesn't appear to be in active development.
I'm working with phpQuery for now: Zend_Dom_Query probably only helps if you're already using Zend Framework. Simple HTML DOM Parser looks too small. phpQuery looks good, also wraps DOMDocument which I'm already using everywhere in my tests, so it doesn't require reparsing for me. DomQuery has disappeared. pqLite is an option, but uses its own node structure, so requires reparsing the document.
Fair warning! pqLite appears to be dead. The one search result I found linked out to a malware site.
9

XPath is a fairly standard way to access XML (and XHTML) nodes, and provides much more precision than CSS.

10 Comments

+1 to bring to 0, but mainly because alternatives are always good.
wow, I was downvoted for this? I'm kinda interested as to why...
Wasn't me the OP! :-) I actually think this would be the best alternative since XHTML is just a subset of XML.
Sometimes people here are rather random. I agreed on XPath being a better tool to use, if it's available. It's standard, more powerful and quite similar to CSS-selectors anyway.
In CSS you couldn't do anything like "select the parent of a 'strong' tag"
|
6

Another one:
http://querypath.org/

1 Comment

Looks better than all the other options, to me - thanks!
6

A great one is a component of symfony 2, CssSelector\Parser­Introduction. It converts CSS selectors into XPath expressions. Take a look =)

Source code

Comments

5

For jQuery users most interesting may be port of jQuery to PHP, which is phpQuery. Almost all sections of the library are ported. Additionally it contains WebBrowser plugin, which can be used for Web Scraping whole site's path/processes (eg accessing data available after logging in). It simply simulates web browser on the server (events and cookies too). Latest versions has experimental support for XML namespaces and CSS3 "|" selector.

Comments

3

I ended up using PHP Query Lite, it's very simple and has all I need.

1 Comment

Downvoted because this doesn't appear to exist any more.
2

For document parsing I use DOM. This can quite easily solve your problem if you know the tag name (in this example "div"):

 $doc = new DOMDocument();
 $doc->loadHTML($html);

 $elements = $doc->getElementsByTagName("div");
 foreach ($elements as $e){
  if ($e->getAttribute("class")!="someclass") continue;

  //its a div.classname
 }

Not sure if DOM lets you get all elements of a document at once... you might have to do a tree traversal.

1 Comment

This method is the fastest of all I've tested. Another to consider is SmartDOMDocument
1

I wrote mine, based on Mootools CSS selector engine http://selectors.svn.exyks.org/. it rely on simplexml extension ability (so, it's read-only)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.