batch html file editing

Question

I have a collection of one thousand HTML files and need to somewhat trim them. I need to delete all the tags inside <body></body> area of those except for one, <div.pg>, to make them clean to be printed. the excess are navigation links which make the prints messy and make the pages occupy more paper. the contents are not the same so I can't find and replace the code excerpt but the tags are the same foe example there are 3 <table> tags to be deleted each with specific class. manipulate specific tags inside batch HTML files?

Any batch processing technique or software to do this job? What an easy solution on windows?

If it's for print, why not simply add a @media print stylesheet to hide any page sections you DON'T want printed? — Marc B
– Marc B, Commented Sep 27, 2011 at 20:57
In fact I want to convert them into PDF before printing, does that help it? would Acrobat render HTML files as to be printed and then make the PDFs? — z403
– z403, Commented Sep 27, 2011 at 20:59

FailedDev · Accepted Answer · 2011-09-27 20:58:18Z

2

I would use an xslt transform on each html page you have. Batch is not the tool to manipulate html files. You can use batch as a "manager" to pass the required file to the xsl transform. Also windows have a rudimentary msxml utility which you can download and install to your machine : http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=21714

That's how I would do it. I am sure there are more options.

answered Sep 27, 2011 at 20:58

FailedDev

27k9 gold badges56 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

z403 Over a year ago

Thanks, but by Batch I meant processing a group of files at once.

FailedDev Over a year ago

Ah OK. Sorry my mistake. Do the images fit in one A4 page? Does each html page contain only one image?

z403 Over a year ago

Each HTML has one <div.pg> whith usually more than one image inside that, and I want all the <div.pg> which is directly inside the <body>.

FailedDev Over a year ago

I still would go with XSLT. I would select the <div.pg> directly under the body and create a different html with only the elements I wanted. Then it would be relatively easy to print. Also for transforming the pages to pdf I can suggest an open source tool which I have also used in the past : code.google.com/p/wkhtmltopdf

abrausch · Accepted Answer · 2011-09-27 21:03:19Z

0

If it is XHTML you could use XSLT to transform your HTML to "another" format. Look for example here: http://www.w3schools.com/xsl/ or here: http://help.hannonhill.com/discussions/how-do-i/269-strip-specific-html-tag-in-xslt

answered Sep 27, 2011 at 21:03

abrausch

531 silver badge3 bronze badges

Collectives™ on Stack Overflow

batch html file editing

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related