codetoad.com
  ASP Shopping CartForum & BBS
  - all for $20 from CodeToad Plus!
  
  Home || ASP | ASP.Net | C++/C# | DHTML | HTML | Java | Javascript | Perl | VB | XML || CodeToad Plus! || Forums || RAM 
Search Site:



Home » ASP » Article

HTTP screen-scraping and caching

Article by:  Troy Wolf  ( 372 ) (5/31/2005)
Bookmark us now! Add to Favourites
Email a friend!Tell a friend
Sponsored by: FindMyHosting - Web Hosting Search
Summary: An ASP class that makes it easy to leverage data from web pages that may be your own or pages from other websites. The class has methods that make it easy to pull entire pages, specific sections of content, tables of data, or even just a specific image from third-party websites.
Viewed: 6566 times Rating (7 votes): 
 4.4 out of 5
 Rate this Article  Read Comments  Post Comments

HTTP screen-scraping and caching



The technique of stealing....er...leveraging content from other people's web pages is commonly referred to as "screen-scraping". Screen-scraping is how you get site content from a website if they don't offer a more natural way to get the content such as an RSS feed or an API.

Many companies that are common targets of screen-scraping (such as Amazon.com and IMDB.com) have strict policies against screen-scraping. So please respect the usage policies of any site you wish to screen-scrape.

The script and techniques in this article provide an ASP class to make screen-scraping easy and powerful.

Since many sites don't appreciate you scraping their content, and those that do don't want you to hit them every 10 seconds, it's a good idea to incorporate some kind of caching mechanism in your code that screen-scrapes. This way you can scrape the other site's content once per hour rather than every-time you get a hit on your own site. For example, if you scrape stock market data from finance.yahoo.com to place content on your home page, you'd prefer (and so would Yahoo) that you don't hit Yahoo for every hit to your site. Instead, hit Yahoo once, cache the result, then use that cached result to serve additional hits to your site for whatever period of time makes sense.

For example, let's say I want to grab the Google home page and display it in my own site. Using the httpcache class, you could simply do the following:


Select All Code


The code above will go pull the page source for http://www.google.com then display it as the output of your page.

Now, let's add caching. The code above will hit Google everytime you refresh the page. Now, let's only hit Google at most once per hour. The code below shows how to use httpcache's caching feature.

Select All Code


You did 2 things. First, you provided a cache filename for the class to use when saving the scraped results to disk. Second, you provided a Time To Live. Now, when you refresh the page, you will pull the data from your webserver harddrive rather than hit Google. That is until after the file is one hour old. The first hit to your page after the cache file is one hour old (3600 seconds) will cause a new hit to Google and a new cache file save.

In this next example, we'll pull only a specific table of data.

Select All Code


We used httpcache's table_dump() method to show a default table view of the data we extracted from the external web page. In reality, you'd probably not use table_dump(). You'd probably want to work with the individual cells of data. For example, if you wanted to output just the 3rd column of the 5th data row:

Select All Code


Finally, another powerful feature of httpcache is the ability to ouput binary data directly. You can use this to output data directly within an <img> tag. You need to create a simple asp page that uses httpcache to return the image. That code is shown at the end of this article. I name this file img_cache.asp.

Select All Code


The above will output an image from www.snippetedit.com. The image will be cached locally on your webserver's hard drive for 60 seconds. Actually, it will be saved there indefinitely, but any new hits to your page after 60 seconds will cause a new hit to the source site.

Here is the code for img_cache.asp:

Select All Code


Notice in the code above their is an include for md5.asp. httpcache uses the md5 function to automatically create a cache filename if you don't supply one. I did not write md5.asp, but it works well, here it is. Copy the code and save it as a file named md5.asp.

Select All Code


Finally, here is the httpcache_class.asp code that you include in your pages like the examples in this article.

Select All Code


This article was a bit more complex than some, but if you put all the pieces together and play with the examples, you'll find httpcache to be a powerful and simple to use class for screen-scraping. The caching feature will make you a good neighbor as it will keep you from causing unnecessary hits to those external sites.

Hope you enjoyed this article and the code. Troy Wolf is the author of SnippetEdit, a website editor written in PHP. SnippetEdit is as simple as it gets for non-technical users to edit pre-defined snippets of content in their websites.




CodeToad Experts

Can't find the answer?
Our Site experts are answering questions for free in the CodeToad forums
Rate this article:     Poor Excellent
View highlighted Comments
User Comments on 'HTTP screen-scraping and caching'
Posted by :  ghubbell at 15:11 on Tuesday, October 25, 2005
Error in Win Server 2003, IIS6:

Microsoft VBScript runtime error '800a01fa'
Class not defined: 'httpcache'
test1.asp, line 4

Any help?


To post comments you need to become a member. If you are already a member, please log in .

 



RELATED ARTICLES
ASP Format Date and Time Script
by Jeff Anderson
An ASP script showing the variety of date and time formats possible using the FormatDateTime Function.
Creating a Dynamic Reports using ASP and Excel
by Jeff Anderson
A simple way to generate Excel reports from a database using Excel.
Create an ASP SQL Stored Procedure
by Jeff Anderson
A beginners guide to setting up a stored procedure in SQL server and calling it from an ASP page.
ASP Shopping Cart
by CodeToad Plus!
Complete source code and demo database(Access, though SQL compatible) to an ASP database driven e-commerce shopping basket, taking the user through from product selection to checkout. Available to CodeToad Plus! Members
Email validation using Regular Expression
by Jeff Anderson
Using regular expression syntax is an exellent way to thoroughly validate an email. It's possible in ASP.
Creating an SQL Trigger
by Jeff Anderson
A beginners guide to creating a Trigger in SQL Server
MagicGrid
by Abhijeet Kaulgud
MagicGrid is an all-in-one grid for ASP programmers. It is a 3 Level Hierarchial Grid. You can Add, Edit, Delete Items under all the three levels. You can also cut-copy-paste Items from one level to other, It happens just by drag & drop!
The asp:checkbox and asp:checkboxlist control
by David Sussman, et al
Checkboxes are similar to radio buttons, and in HTML, they were used to allow multiple choices from a group of buttons.
ASP.NET Forum Source Code
by ITCN
Complete open source website Forum and Discussion Board programmed in Microsoft dot Net 1.1 Framework with Visual Basic.
The asp:listbox control
by David Sussman, et al
The next HTML server control that we'll look at, <asp:listbox>, is very much related to <asp:dropdownlist>.








Recent Forum Threads
• Re: sorting and Linked list
• Re: need help linked list
• Re: Help with arrays
• Re: Reading from a file
• Re: Why Use Method?
• Re: Help with a simple program
• Re: need help with quiz
• Re: Help with filesystem object & displaying in a table
• Re: Genetic Algorithm Help


Recent Articles
Multiple submit buttons with form validation
Understanding Hibernate ORM for Java/J2EE
HTTP screen-scraping and caching
a javascript calculator
A simple way to JTable
Java Native Interface (JNI)
Parsing Dynamic Layouts
MagicGrid
Caching With ASP.Net
Creating CSS Buttons


Site Survey
Help us serve you better. Take a five minute survey. Click here!

© Copyright codetoad.com 2001-2005