12

I want to convert all the .doc files from a particular folder to .docx file.

I tried using the following code,

import subprocess
import os
for filename in os.listdir(os.getcwd()):
    if filename.endswith('.doc'):
        print filename
        subprocess.call(['soffice', '--headless', '--convert-to', 'docx', filename])

But it gives me an error: OSError: [Errno 2] No such file or directory

7 Answers 7

23

Here is a solution that worked for me. The other solutions proposed did not work on my Windows 10 machine using Python 3.

from glob import glob
import re
import os
import win32com.client as win32
from win32com.client import constants

# Create list of paths to .doc files
paths = glob('C:\\path\\to\\doc\\files\\**\\*.doc', recursive=True)

def save_as_docx(path):
    # Opening MS Word
    word = win32.gencache.EnsureDispatch('Word.Application')
    doc = word.Documents.Open(path)
    doc.Activate ()

    # Rename path with .docx
    new_file_abs = os.path.abspath(path)
    new_file_abs = re.sub(r'\.\w+$', '.docx', new_file_abs)

    # Save and Close
    word.ActiveDocument.SaveAs(
        new_file_abs, FileFormat=constants.wdFormatXMLDocument
    )
    doc.Close(False)

for path in paths:
    save_as_docx(path)
Sign up to request clarification or add additional context in comments.

3 Comments

I am getting this error --> com_error: (-2147352567, 'Exception occurred.', (0, 'Microsoft Word', "Sorry, we couldn't find your file. Was it moved, renamed, or deleted?\r (C:\\//Users/shreyajain/Documents/Docum...)", 'wdmain11.chm', 24654, -2146823114), None) Any suggestion?
@Shreyansjain Based on the error message, I'm guessing you typed in the file path incorrectly. Although, it's difficult to tell without seeing your code.
1) This also allows you to convert PDF files into DOCX, allowing you to read the content of PDF documents. 2) I would suggest to add a TRY at the start of the program, to check that MS-Word is installed : MSWord_OK = True try: word = win32.gencache.EnsureDispatch('Word.Application')
5

I prefer to use the glob module for tasks like that. Put this in a file doc2docx.py. To make it executable, set chmod +x. And optionally put that file in your $PATH as well, to make it available "everywhere".

#!/usr/bin/env python

import glob
import subprocess

for doc in glob.iglob("*.doc"):
    subprocess.call(['soffice', '--headless', '--convert-to', 'docx', doc])

Though ideally you'd leave the expansion to the shell itself, and call doc2docx.py with the files as arguments, like doc2docx.py *.doc:

#!/usr/bin/env python

import subprocess
import sys

if len(sys.argv) < 2:
    sys.stderr.write("SYNOPSIS: %s file1 [file2] ...\n"%sys.argv[0])

for doc in sys.argv[1:]:
    subprocess.call(['soffice', '--headless', '--convert-to', 'docx', doc])

As requested by @pyd, to output to a target directory myoutputdir use:

#!/usr/bin/env python

import subprocess
import sys

if len(sys.argv) < 2:
    sys.stderr.write("SYNOPSIS: %s file1 [file2] ...\n"%sys.argv[0])

for doc in sys.argv[1:]:
    subprocess.call(['soffice', '--headless', '--convert-to', 'docx', '--outdir', 'myoutputdir', doc])

2 Comments

From my tests this only fails when the working/target directory in question is the root of the filesystem, e.g. directly C:\ or D:\. Any other folder works fine. Looks like a bug in soffice. You can specify the output directory by using the option --outdir <directory-name>.
do i need to pass one more argument ?? can you edit your answer
3

If you don't like to rely on sub-process calls, here is the version with COM client. It is useful if you are targeting windows users without LibreOffice installed.

#!/usr/bin/env python

import glob
import win32com.client

word = win32com.client.Dispatch("Word.Application")
word.visible = 0

for i, doc in enumerate(glob.iglob("*.doc")):
    in_file = os.path.abspath(doc)
    wb = word.Documents.Open(in_file)
    out_file = os.path.abspath("out{}.docx".format(i))
    wb.SaveAs2(out_file, FileFormat=16) # file format for docx
    wb.Close()

word.Quit()

4 Comments

It is clean. However, i wonder is there any platform-independent way to convert doc into docx?
@longbowking There is no swiss knife library to take care of this when I looked last year. One possible method is to detect OS with sys.platform and try Jan Christoph Terasa's approach for Linux, my approach for Windows. Not sure what works for Mac.
Just tried unoconv with this docker image, doc -> docx, but the resulting docx was damaged (files contained comments that I needed to preserve).
@longbowking its possible if u install LibreOffice on the platform
2

Use os.path.join to specify the correct directory.

import os, subprocess

main_dir = os.path.join('/', 'Users', 'username', 'Desktop', 'foldername')

for filename in os.listdir(main_dir):
    if filename.endswith('.doc'):
        print filename
        subprocess.call(['soffice', '--headless', '--convert-to', 'docx', filename])

1 Comment

for me it doesnt work, it displays and xml inside the docx
2

based on dshefman's code,

import re
import os
import sys
import win32com.client as win32
from win32com.client import constants

# Get path from command line argument
ABS_PATH = sys.argv[1]

def save_as_docx(path):
    # Opening MS Word
    word = win32.gencache.EnsureDispatch('Word.Application')
    doc = word.Documents.Open(path)
    doc.Activate ()

    # Rename path with .docx
    new_file_abs = os.path.abspath(path)
    new_file_abs = re.sub(r'\.\w+$', '.docx', new_file_abs)

    # Save and Close
    word.ActiveDocument.SaveAs(new_file_abs, FileFormat=constants.wdFormatXMLDocument)
    doc.Close(False)

def main():
    source = ABS_PATH

    for root, dirs, filenames in os.walk(source):
        for f in filenames:
            filename, file_extension = os.path.splitext(f)

            if file_extension.lower() == ".doc":
                file_conv = os.path.join(root, f)
                save_as_docx(file_conv)
                print("%s ==> %sx" %(file_conv,f))

if __name__ == "__main__":
    main()

Comments

1

This version uses doc2docx, which I believe only works on windows or mac. I believe this is the cleanest version so far, if you can use windows. To use it, you must install doc2docx first, which can be done from anaconda (or pip).

import doc2docx
from glob import glob
import os
def convert_doc_to_docx(folder):
    # Stores all doc files to be removed later
    doc_files = glob('{}/*.doc'.format(folder))
    
    # Now do the conversion. Note that doc2docx converts all files in a given folder
    doc2docx.convert(folder)

    # Remove all old doc_files
    for doc_file in doc_files:
        os.remove(doc_file)

convert_doc_to_docx('C:/Users/user/folder_containing_doc_files/')

Comments

0

By default, the os.path.exists() function in Python on Windows is case-insensitive, regardless of whether you have enabled case sensitivity for a specific folder. This means that:

Checking for "Cv.pdf" will return True if "cv.pdf" exists, even if the cases don't match.

If you want to enforce case-sensitive checks for file existence, you can manually check the case using os.listdir() to compare the actual filenames:

def case_sensitive_exists(file_path):
    directory, file_name = os.path.split(file_path)
    return file_name in os.listdir(directory)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.