-1

After using the Pdf2htmlEX command to convert my PDF to HTML, I translated my HTML and then used the wkhtmltopdf command to convert this HTML to PDF, it gave me the PDF file correctly but the problem was that it became extremely large since the old PDF was 116.4K and the new converted and translated PDF is 3.4MB.

Here are my HTML and PDF files on Github.

Here's the command I used to convert PDF to HTML with Pdf2htmlEX:

pdf2htmlEX --fit-width 1024 --space-as-offset 1 fss4.pdf fss4.html

Here's the command I used to convert HTML to PDF with Pdf2htmlEX:

xvfb-run -a wkhtmltopdf --no-images --quiet --dpi 150 --disable-smart-shrinking fss4.html fss4-fr.pdf

What can I do to reduce the size of this PDF file? I'm confused, I don't know what to do.

Any of you guys will help me a lot... Do you even have an idea how I should solve this problem???

2 Answers 2

0

The file is a highly specialist Adobe format with scripting and restricted abilities in other readers.

Below showing Acrobat warning and 3rd part Reader recommendation it only works in Adobe Reader. There is much proprietary Forms data and thus not suitable for use in any other application. It is purely designed to be used with Adobe Licensed Server applications. (Really these files should not carry the .PDF extension, but use say .XFA, however that is Adobes prerogative and they universally use .PDF for reader based files.) enter image description here enter image description here

  • You should convert using an XFA to PDF application not try to bypass by inferior conversion messy means.

  • You should have no need to convert such a form as it will not work in any other application.

Even if you neuter scripting and the Adobe enhancements it will still say in Acrobat. You can not use this file as a simple e-form and must be simply printed out, as if a paper record!

enter image description here

The only suitable means to convert such a file is PRINT to PDF

  • So best for paper filling is printout such as a flatter paper image with gigantic increase in size and less ability to be acceptable as an online resource.

If you want a smaller electronic file of 33 KB. Then use GhostScript to remove all the baggage and attempt to "FIX" the file into a conventional PDF. You will then need to add conventional PDF fields to the result.

NOTE the comment that the Adobe file format does not meet Adobe PDF published standard format. (Perfectly correct, as it is an Acrobat Designer specific format!)

gs -sDEVICE=pdfwrite -oform.pdf fss4.pdf

enter image description here

Finally

Now you have a new file to transfer the old fields to new. You can use any suitable PDF SDK, to copy the fields across and the final file will be much much smaller without all that XFA nonsense.

This is what an XFA to PDF converter such as Apryse / Aspose or other powerful PDF products will do, faster and better than my manual approach.

Fields copied over produces a TrueForm.pdf of 82.47 KB (84,451 bytes).

enter image description here

Colateral damage is you should always test radio button features since they have enhanced group logic. So a manual copy may not work correctly without manual grouping. Thus as per the OP example the copy (without additional editing) does not control YES OR NO it will allow both to be accepted!

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

0

We might expect something to do with images, but no, it is the actual PDF contents streams:

$ cpdf -composition fss4.pdf
Images: 0 bytes (0.00%)
Fonts: 16988 bytes (14.06%)
Content streams: 38147 bytes (31.57%)
Structure Info: 9452 bytes (7.82%)
Attached Files: 0 bytes (0.00%)
XRef Table: 12658 bytes (10.48%)
Unclassified: 43593 bytes (36.08%)
$ cpdf -composition fss4-fr.pdf
Images: 226713 bytes (7.19%)
Fonts: 12694 bytes (0.40%)
Content streams: 2908943 bytes (92.31%)
Structure Info: 0 bytes (0.00%)
Attached Files: 0 bytes (0.00%)
XRef Table: 760 bytes (0.02%)
Unclassified: 2017 bytes (0.06%)

Upon closer inspection, it's not even inline images in the content streams. Something in your process has converted most of the text in the file to shapes - and in an inefficient way, so each letter is stored separately. So you have (uncompressed) a 15Mb content stream, and compressed about 3Mb. Why, I can't tell you - that's a wkhtmltopdf problem.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.