2

I'm interested in writing a script, preferably one easy to add on to browsers with tools such as Greasemonkey, that sends a page's HTML source code to an external server, where it will later be parsed and useful data would be sent to a database.

However, I haven't seen anything like that and I'm not sure how to approach this task. I would imagine some sort of HTTP post would be the best approach, but I'm completely new to those ideas, and I'm not even exactly where to send the data to parse it (it doesn't make sense to send an entire HTML document to a database, for instance).

So basically, my overall goal is something that works like this (note that I only need help with steps 1 and 2. I am familiar with data parsing techniques, I've just never applied them to the web):

  1. User views a particular page
  2. Source code is sent via greasemonkey or some other tool to a server
  3. The code is parsed into meaningful data that is stored in a MySQL database.

Any tips or help is greatly appreciated, thank you!

Edit: Code

ihtml = document.body.innerHTML;
GM_xmlhttpRequest({
method:'POST',
url:'http://www.myURL.com/getData.php',
data:"SomeData=" + escape(ihtml)
});

Edit: Current JS Log:

Namespace/GMScriptName: Server Response: 200
OK
4
Date: Sun, 19 Dec 2010 02:41:55 GMT
Server: Apache/1.3.42 (Unix) mod_gzip/1.3.26.1a mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_ssl/2.8.31 OpenSSL/0.9.8e-fips-rhel5 PHP-CGI/0.9
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html

Array
(
)

http://www.url.com/getData.php
1
  • This sounds like something that would be used for behavioural targeting of advertising or similar - be aware that the page source in question may contain sensitive information (emails, bank records, etc.) Personally, I'd avoid this entirely but if you insist on doing it, make sure your code is VERY secure. Commented Dec 18, 2010 at 0:00

1 Answer 1

3

As mentioned in the comment on your Q, I'm not convinced this is a good idea and personally, I'd avoid any extension that did this like the plague but...

You can use the innerHTML property available on all html elements to get the HTML inside that node - eg the body element. You could then use an AJAX HTTP(S!) request to post the data.

You might also want to consider some form of compression as some pages can be very large and most users have better download speeds than upload speeds.

NB: innerHTML gets a representation of the source code that would display the page in its current state, NOT the actual source that was sent from the web server - eg if you used JS to add an element, the source for that element would be included in innerHTML even though it was never sent across the web.

An alternative would be to use an AJAX request to GET the current URL and send yourself the response. This would be exactly what was sent to the client but the server in question will be aware the page was served twice (and in some web applications that may cause problems - e.g. by "pressing" a delete button twice)

one final suggestion would be to simply send the current URL to yourself and do the download on your own servers - This would also mitigate some of the security risks as you wouldn't be able to retrieve the content for pages which aren't public

EDIT:

NB: I've deleted much spurious information which was used in tracking down the problem, check the edit logs if you want full details

PHP Code:

<?php
    $PageContents = $_POST['PageContents']
?>

GreaseMonkey script:

 var ihtml = document.body.innerHTML;
 GM_xmlhttpRequest({
  method:'POST',
  url:'http://example.com/getData.php',
  data:"PageContents=" + escape(ihtml),
  headers: {'Content-type': 'application/x-www-form-urlencoded'}
 });
Sign up to request clarification or add additional context in comments.

26 Comments

I should have mentioned that I'll only use this script on one particular website that wouldn't contain any sensitive information, just data that I'm looking to parse easily. Could some explain the possibility of utilizing HTTP via Ajax or other tools to do this? I've looked around for examples and the best I've found are scripts that are intended to fill in forms through URL information, which I don't think would apply to an entire page's source code.
Thanks for that! Something in Greasemonkey is perfect for my needs. I understand the fields in the GM_xmlhttpRequest object, but could you give me an idea of how the MyScript.php receives the information?
I've been playing around with that code for a while and I'm having an issue. I think it's because $_POST can't look up an entry for 'SomeData' because it's never defined in the GM script. I looked for some equivalent to the name parameter in forms, but there doesn't seem to be any. Do I need some other way to identify the data coming in from the GM script or am I just doing something completely wrong?
When you POST data, the data should be in the same format as URL parameters eg name1=value1&name2=value2 - PHP can then look at the $_POST['name1'], $_POST['name2'] variables. If you can post the code you're using in the GM AJAX call and the PHP, we can identify the problem
The only code I have is the piece of the GM Script you provided, the MyScript.php with some added database insertions/queries just to test the script. Did I miss some step in setting the parameters?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.