1

My question is related to another question found here Scraping an HTML table in Common Lisp?

I am trying to extract data from a webpage in common lisp. I am currently using drakma to send the http request, and I'm trying to use chtml to extract the data I am looking for. The webpage I'm trying to scrap is http://erg.delph-in.net/logon, here is my code

(defun send-request (sentence)
 "sends sentence in an http request to logon for parsing, and recieves
  back the webpage containing the MRS output"
 (drakma:http-request "http://erg.delph-in.net/logon" 
                   :method :post 
                   :parameters `(("input" . ,sentence)
                                 ("task" . "Analyze")
                                 ("roots" . "sentences")
                                 ("output" . "mrs")
                                 ("exhaustivep" . "best")
                                 ("nresults" . "1"))))

And here's the function I am having trouble with

(defun get-mrs (sentence)
    (let* (
       (str (send-request sentence))
       (document (chtml:parse str (cxml-stp:make-builder))))
      (stp:filter-recursively (stp:of-name "mrsFeatureTop") document)))

Basically all the data I need to extract is in an html table, it's too big to paste here though. In my get-mrs function, i was just trying to get the tag with name mrsFeatureTop, I am not sure if this is correct though since I am getting an error: not an NCName 'onclick. Any help with scraping the table will be greatly appreciated. Thank you.

1
  • Thanks wvxvw, I agree that a lot of webpages are rubbish. I will try out your suggestion. You've helped me a lot before with other questions btw, I thank you for that :). Commented May 19, 2013 at 17:08

1 Answer 1

3

Ancient question, I know. But one that that defeated me for a long time. It's true that a lot of webpages are rubish, but nearly the entire 2.0 is build upon screen scraping, integrating heterogeneous websites with hack upon hack -- should be an ideal application for Lisp!

The key (in addition to drakma) is lquery which allows you to access the pages contents using a lispy transliteration of css selectors (what jquery uses).

Let's get the links from the media strip on Google's news page! If you open https://news.google.com in a browser and view source. You'll be overwhelmed by the complexity of the page. But if you view the page in the browsers development panel (Firefox: F12, Inspector) You'll see the page has some logic to it. Use the search box to find .media-strip-table That element contains the images we want. Now open your favourite repl. (Well, let's be honest here, Emacs: M-x slime)

(ql:quickload '(:drakma :lquery))

;;; Get the links from the media strip on Google's news page.
(defparameter response  (drakma:http-request "https://news.google.com/"))

;;; lquery parses the page and gets it ready to be queried.
(lquery:$ (initialize http-response))

Now let's explore the results

;;; package qualified '$' opperator, Barbaric!  
;;; Use (use-package :lquery) to omit the package prefix.
(lquery:$ ".media-strip-table" (html))

Wow! that's just a tiny section of the page? Ok, how about the first element?

(elt (lquery:$ ".media-strip-table" (html)) 0)

OK, that's a little more manageable. Let's see if there's an image tag in there somewhere, Emacs: C-s img Yay! There it is.

(lquery:$ ".media-strip-table img" (html))

Hmmm... It's finding something, but only returning empty text... Oh yeah, image tags are supposed to be empty!

(lquery:$ ".media-strip-table img" (attr :src))

Holy crap! gif's aren't just used for unfunny, grainy animations?

Sign up to request clarification or add additional context in comments.

1 Comment

+1 for lquery ! which lives here: shinmera.github.io/lquery (and also a little tutorial for web scraping: lispcookbook.github.io/cl-cookbook/web-scraping.html

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.