5

I'd like to extract some information from a web page that's contained in an HTML <table>. How can I extract all the table information into a nice | separated file?

Author|Book|Year|Comments
Bill Bryson|Short History of Nearly Everything|2004
Stephen Hawking|A Brief History of Time|1998|Still haven't read.

Ideally, I'd like to have a function that takes a URL and output file as parameters then gives the above output.

(defun extract-table (url filename)
       (extract-from-html-table (fetch-web-page url)))

(extract-table "http://www.mypage.com" "output.txt")

Sample HTML input for the above output:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
<head>
<title>Lisp</title>
</head>
<body>
<h1>Welcome to Lisp</h1>
<table class="any" style="font-size: 14px;">
  <TR class="header">
    <td>Author</td>
    <TD>Book</TD>
    <td>Year</td>
    <td>Comments</td>
  </TR>
  <tr class="odd">
    <td>Bill Bryson</td>
    <td>Short History of Nearly Everything</td>
    <td>2004</td>
  </tr>
  <tr>
    <td>Stephen Hawking</td>
    <td>A Brief History of Time</td>
    <td>1998</td>
    <td>Still haven't read.</td>
  </tr>
</table>
</body>
</html>

1 Answer 1

7

Start with Drakma for fetching the data. To parse the thing, you might find cxml helpful. Or better yet: you could use closure-html, which should parse arbitrary HTML 4. The Common-Lisp.net page of the closure-html package has a screen scraping example.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.