Hi I want to save a website's source code into a file using java. From the source code i want to get only <script> </script> tag contents how can i do that?
2 Answers
Use an HTML parser in Java to extract text from HTML.
1 Comment
BalusC
To expand the (right) answer a bit: here are several listed: java-source.net/open-source/html-parsers
Once you've loaded the source code to a variable in Java, find the position of <script> and the position of </script> in the file and delete everything that's not inside that range.
Something like:
String sourceCode = "source code here"
String startTag = "<script>";
String endTag = "</script>";
int startInt = sourceCode.indexOf(startTag);
int endInt = sourceCode.indexOf(endTag);
So the substring would be:
String jsCode = sourceCode.substring(startInt,endInt);
(This may be plainly wrong, I can't test it at the moment, sorry)
5 Comments
user236501
I not sure how to do that can you please direct me to any tutorial or example i had researched using Google few days already still can not find what exactly I want
user236501
Hi thanks, but my source code got multiple script element, any solution to grab multi <script> element?
Val
newbie, take Johnny's code and put it in a loop. the indexOf() method will find the NEXT occurence of the string, so the first time thru the loop, you found the first start/end pair. on the next loop iteration, set your starting position to 1 past the endInt, and you'll find the next pair. Each time thru the loop, add the jsCode string to a Collection. When there are no more matches, you're done, and your collection has an item for each script element you found. Note that this only gets you the code of inline scripts, not the source of included scripts e.g. [script src='foo.js][/script]
user236501
Can I put in the while loop so what condition i should put, because I keep hang when I call the method I think the problem is infinite loop. while(){ int startInt = sourceCode.indexOf(startTag); int endInt = sourceCode.indexOf(endTag); }
Val
get the length of the entire body of the page. in the while() clause, test that startInt is less than that length. each time thru the loop, you'll have to make indexOf() start just past the last endInt: indexOf(sourceCode, endInt +1) or somesuch. Look at java-samples.com/showtutorial.php?tutorialid=225. That's all from me, gotta go home now...
<script>tag contents.