0

I'm working on a Java program that programmatically converts .doc- and .docx-files to pdf. I've tested several different ways to convert .doc- and .docx-files to pdf such as using several open source Java libraries, sadly these libraries would often mess up the layout in the documents.

I've stumbled upon a javascript script to use the underlying Microsoft Word instance to open the file and save it as a PDF (found at: https://superuser.com/questions/17612/batch-convert-word-documents-to-pdfs-free/28303#28303):

var fso = new ActiveXObject("Scripting.FileSystemObject");
var docPath = WScript.Arguments(0);
var pdfPath = WScript.Arguments(1);
docPath = fso.GetAbsolutePathName(docPath);
var objWord = null;
try{
    WScript.Echo("Saving '" + docPath + "' as '" + pdfPath + "'...");
    objWord = new ActiveXObject("Word.Application");
    objWord.Visible = false;
    var objDoc = objWord.Documents.Open(docPath);
    var wdFormatPdf = 17;
    objDoc.SaveAs(pdfPath, wdFormatPdf);
    objDoc.Close();
    WScript.Echo("The CV was succesfully converted.");
} catch(err){
    WScript.Echo("An error occured: " + err.message);
}finally{
    if (objWord != null){
        objWord.Quit();
    }
}

This javascript-script is called from my Java program synchronously for each document.

On a small scale this seems to work great, but when dealing with a lot of documents like several thousands, I encountered a couple of problems:

  • Sometimes one Word process would hang at the 'Save as'-prompt, if this happened user intervention was needed to continue. Until any user interaction the process would just block.
  • Sometimes the Word process would hang at a 'Bookmark'-prompt. The process is also blocked until any user intervention to pass the prompt.

I'm looking for the best/cleanest way to somehow control these Word processes better by giving them a deadline or something. Like giving them 5 seconds to open the Word document and save it as a PDF, after 5 seconds the process would be killed if still active.

I've dealt with something similiar in the past and the solution for that included a 'kill word processes batch script' to kill any WORD processes that were stuck after the program ended. Not very clean but it did its job.

Any experiences or ideas would be appreciated!

6
  • That is javascript or worse, not Java. Commented Jan 7, 2013 at 16:29
  • 1
    Unless you're trying to learn the technology, just install a pdf printer, and "print" the documents into pdf. I used the (non-free) one available with Adobe Acrobat, but there seems to be many free utilities available to do the same thing. Commented Jan 7, 2013 at 16:36
  • Does stackoverflow.com/questions/607669/… suffer the same problem? (C# alike) Commented Jan 7, 2013 at 16:50
  • support.microsoft.com/kb/257757/en-us - Microsofts notes on automating Office (they don't recommend it). Commented Jan 7, 2013 at 17:11
  • @mlk, the warning is only if the automation is done on the server-side which is not the case here (it's not mentionned in the question). Commented Jan 7, 2013 at 21:22

3 Answers 3

2

You can use https://www.npmjs.com/package/@nativedocuments/docx-wasm serverless (eg AWS Lambda) to perform your conversions in parallel. Lambda takes care of the concurrency. docx-wasm is self-contained (ie no need to be running Microsoft Word). Freemium model.

Edit April 2019

https://github.com/NativeDocuments/docx-to-pdf-on-AWS-Lambda is a sample project for using it on Lambda.

Sign up to request clarification or add additional context in comments.

1 Comment

docx-wasm is no longer available. Their site has been taken down and they are no longer issuing licences.
1

I managed to get around the issue related to the process being stuck at a prompt in Microsoft Word. In my final solution I altered my Java code to make it start the Javascript script in a separate Thread. My main Thread would then sleep for a few seconds and would then check the other Thread.

The other Thread keeps a reference to the Process instance it uses to run the Javascript-script. The main Thread would then check the exitValue of that process, if the script would be stuck at a Microsoft Word prompt a IllegalThreadStateException would be thrown. I would then handle the Exception by killing the process and cleaning up any temporary files left by Microsoft Word.

Comments

-2

Microsoft support says don't use office unattended neither server side.

If you need simple conversion LibreOffice has a commandline option -convert-to.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.